In this post I look at how you can split a repository in two, creating a "history" repository, and a current repository, while retaining the ability to temporarily merge them again when required. This is made possible by the git-replace tool.
Background: when repositories get too big
Have you ever cloned the .NET runtime git repository? If you have you should know it takes a looooong time to clone. That's not surprising when you see the size of it:
> git clone https://github.com/dotnet/runtime
Cloning into 'runtime'...
remote: Enumerating objects: 1933225, done.
remote: Counting objects: 100% (44/44), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 1933225 (delta 4), reused 15 (delta 1), pack-reused 1933181
Receiving objects: 100% (1933225/1933225), 631.19 MiB | 24.13 MiB/s, done.
Resolving deltas: 100% (1409395/1409395), done.
Updating files: 100% (59281/59281), done.
Doing a fresh clone, you need to download 1.9 million objects, equating to 631MiB. That's…quite a lot. Most of that size is taken up by the history of 117,251 commits, as opposed to the sheer number of files. That history is a large part of where git gets its power, so you don't generally want to lose it, but there are times when it makes sense.
I don't generally deal with repos that big, but I have had to work on some sizeable repositories, and the sad truth is that most of the time, you really don't need that history from 5 years ago. Recent commits carry a lot of value, and the older the commits, the lower the value.
We contemplated starting "fresh" by creating a new git repo, copying all the files across, and archiving the old repository. This has some pros and cons:
- Pro
git clone
becomes much faster as the repository is much smaller without the history. Taking the .NET runtime repo as an example, it would go from 631MiB to ~90MiB - Pro You can still see the project history prior to the change in the old repository.
- Con
git blame
becomes less useful, as you can't see any history prior to the first commit - Con If you need information from both the new and old repository, that's hard to work with
We decided against it in the end, as the benefits just weren't good enough to justify the inconvenience when you do need the history.
However, I found out recently that we could have had our cake and eaten it by using git replace
. For the rest of the post I'll explore how we can achieve the best of both worlds:
- A small "current" repository containing a minimal amount of history
- A "history" repository containing all the commits prior to the creation of the new repository
- The ability to merge them together when required, grafting the history onto our "current" repository!
1. Creating the history repository
For convenience, I'm going to demonstrate the general approach using one of my smaller repositories, StronglyTypedId, but you would really only want to do this with repositories that have become too large and unwieldly. You also need to be aware that this requires some rewriting (rebasing) of commits, so be careful!
This post is based on the Git Tools - Replace entry from the git book - it's pretty much the same thing, I've just worked through it to make sure I understand!
We'll start by cloning the original repository into a folder:
git clone https://github.com/andrewlock/StronglyTypedId
cd StronglyTypedId
From the command line, this looks like the following (git log --oneline
).
0a698f1 (HEAD -> main, tag: v1.0.0-beta06, origin/main, origin/HEAD) Bump version to beta06
0a1a180 Add "#pragma warning disable 1591" to generated code
590f4d3 Add support for EF Core global conversions
5242f13 Add support for NewId package (#52)
b0eb121 fixes #44
9de5ea0 (tag: v1.0.0-beta05) Merge pull request #42 from andrewlock/single-package-attempt-2
f101d65 Remove unneccessary constant
...
Obviously it goes on beyond that, but that's sufficient for now. Visually, these commits are linearly ordered with each pointing to the parent commit:
Lets say, we want to "archive" the history from the "Add support for EF Core global conventions"
commit (590f4d3
). First, we would create a new branch called history
at this commit:
git branch history 590f4d3
So our git commits look like this:
Now we can create our "project-history" repository by pushing this branch to a new remote git repository.
In practice you'd create a new git repository on GitHub (for example) and push to that. Instead, for simplicity, I'm going to push to a local folder for now, and treat that as a remote
# Initialize the "remote" repository
git init C:\repos\git-replace\history
# Add the remote repository to original clone
git remote add project-history C:\repos\git-replace\history
# Push the history branch to the remote repository
git push project-history history:main
After pushing to the remote repository, our original clone looks something like this:
While the remote repository contains only the commits up to and including "Add support for EF Core global conventions"
(590f4d3
):
That's our "project-history" repository complete.
2. Squashing history to reduce repository size
So we now have a "project-history" repository, which contains the historic commits for the repository. However, our main clone also contains all those commits currently. We want to squash all those commits together to reduce the size of the repository (and losing the commit history in the process).
So that things work correctly, we need to squash everything prior to the latest commit in the project-history repository. So we need to leave 590f4d3
untouched, and squash everything prior to this commit.
We can do this using git commit-tree
. This command is a 'plumbing' command that you won't commonly need to run. This essentially takes a "snapshot" of the working directory at a given commit, and creates a new commit containing the files.
To help our future-selves, we include some instructions in the commit message of this new commit about how to reconstitute the history. We'll come back to this later.
To create the squashed commit we use the format
git commit-tree -m <commit message> "<commit>^{tree}"
Where <commi>
is the commit reference for which we want to generate the tree. For our purposes, we use the parent of the latest commit in the "project-history" repository (590f4d3~
means "the parent of 590f4d3
"):
> git commit-tree -m "For historic commits, run 'git replace <child-ID> 590f4d3'" "590f4d3~^{tree}"
d3bee05dac84c66b7d13f99a5edf790688f51494
# The returned value d3bee05 is the commit ID of the new commit
This creates a new "floating" commit (d3bee05
), which contains the contents of the working-directory at the state of the commit 590f4d3~
, i.e. the parent commit of 590f4d3
. You can think of it as squashing all of the previous commits into one. This will be the source of the "compaction" to make our repository smaller.
Now we're going to rebase the remaining commits on top of this "floating" root commit. We rebase everything that wasn't squashed on top of our new floating commit
Although not necessary in this case, I've added
--rebase-merges
to keep the "merge commit" structure if you're using merge commits for your PR merges.
git rebase 590f4d3~ --onto d3bee05 --rebase-merges
Visually, we're "moving" the commits onto the new floating root commit:
Note that the "Add support for EF Core global conversions"
commit appears as both the last commit in the project-history repository and as the first "real" commit in this repository. This is important, as you'll see later.
If you check the git log using git log --oneline
then the whole history looks like this:
546774e (HEAD -> master) Bump version to beta06
0541428 Add "#pragma warning disable 1591" to generated code
92305a9 Add support for EF Core global conversions
d3bee05 For historic commits, run 'git replace <child-ID> 590f4d3b'
I haven't truncated the history, that's all there is! If we look at the state of our current clone in gitk
you can see that the main
branch is based off the floating commit d3bee05
, and is now completely separate from the history
branch
At this point, we're done crafting our new "smaller" repository, so we can push our main
branch to the repository, effectively compacting the history. Notice however, that all the commit hash IDs have changed. That's expected, because we've rebased, but it's something you need to take into consideration before continuing!
WARNING, this is rewriting history of the main branch! You probably shouldn't push these changes in a public repo. A (better?) alternative would be to push this to an entirely new repository and archive the original instead.
For this example, I'm going to assume you're working on a "private" repo, so you can force-push to main
without the sky falling:
git push origin main --force-with-lease
With that we're basically done. You have a small "current" git repository to use for active development, and a "project-history" repository that contains the remaining history. In the next section we'll look at how to join them using git replace
.
Joining two repositories with git-replace
The small repository will be quicker and easier to use, but at some point you're going to want to reconstitute the history so you can do a git blame
, or something similar. We're going to do that by using git replace
.
Lets imagine you're a new user, working in a fresh clone of the new, compacted, repo:
git clone https://github.com/andrewlock/StronglyTypedId
From your point of view, you currently only have the commits from the main clone:
But you realise that to understand some of the code, you need to check the history, so you add the "historical" repo as a new remote, and create a history
branch to track it.
> git remote add project-history C:\repos\git-replace\history
> git fetch project-history
From C:\repos\git-replace\history
* [new branch] main -> project-history/main
> git branch history project-history/main
branch 'history' set up to track 'project-history/main'.
Now you have recent commits in the main
branch and the historical commits in the project-history/main
branch, but they're disconnected, so your git tooling (and IDEs) still won't see the "historical" commits.
To connect the two repositories, we're going to use the fact that the "Add support for EF Core global conversions"
commit appears in both trees. We essentially "replace" the main
version of the commit (92305a9
) with the history
version of the commit (590f4d3
). By doing that the two trees will become connected, and all our existing git
tools will work as we would like!
To do the replacements, and combine the branches you need to run git replace <main-id> <historical-id>
git replace 92305a9 590f4d3
After running this command, the repository now looks something like this:
All our usual git
commands (and IDEs) will think that the parent of 92305a9
points to a commit from the project-history repository, for example git log --oneline
:
546774e (HEAD -> main, origin/main, origin/HEAD) Bump version to beta06
0541428 Add "#pragma warning disable 1591" to generated code
92305a9 (replaced) Add support for EF Core global conversions
5242f13 Add support for NewId package (#52)
b0eb121 fixes #44
9de5ea0 (tag: v1.0.0-beta05) Merge pull request #42 from andrewlock/single-package-attempt-2
f101d65 Remove unneccessary constant
...
Note that
- The first commits in
main
have the same commit IDs.git replace
hasn't changed these. This isn't a rebase. - The commit that was replaced has the same commit ID (from
main
), but is marked as(replaced)
. - All subsequent parent commits from the historical repo are there, with their original commit IDs!
After doing what you need to do, you can revert all of these changes by running:
# Delete the replacement reference
git replace -d 92305a9
# Delete the history branch
git branch -d history
# Remove the remote repository
git remote remove project-history
Now you'll be back to a nice small repository again!
Summary
In this post I explored a use-case for the git replace
command. If you have a very large repository, at some point you might find it useful to "archive" the history, by pushing it to a separate git repo, and creating a "compacted" repo for future development. The downside to this approach is it makes running commands like git blame
or git log
problematic. You can work around this limitation by using git replace
to graft one commit tree onto another.