Quantcast
Channel: Andrew Lock | .NET Escapades
Viewing all articles
Browse latest Browse all 743

Reducing the size of a git repository with git-replace

$
0
0

In this post I look at how you can split a repository in two, creating a "history" repository, and a current repository, while retaining the ability to temporarily merge them again when required. This is made possible by the git-replace tool.

Background: when repositories get too big

Have you ever cloned the .NET runtime git repository? If you have you should know it takes a looooong time to clone. That's not surprising when you see the size of it:

> git clone https://github.com/dotnet/runtime
Cloning into 'runtime'...
remote: Enumerating objects: 1933225, done.
remote: Counting objects: 100% (44/44), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 1933225 (delta 4), reused 15 (delta 1), pack-reused 1933181
Receiving objects: 100% (1933225/1933225), 631.19 MiB | 24.13 MiB/s, done.
Resolving deltas: 100% (1409395/1409395), done.
Updating files: 100% (59281/59281), done.

Doing a fresh clone, you need to download 1.9 million objects, equating to 631MiB. That's…quite a lot. Most of that size is taken up by the history of 117,251 commits, as opposed to the sheer number of files. That history is a large part of where git gets its power, so you don't generally want to lose it, but there are times when it makes sense.

I don't generally deal with repos that big, but I have had to work on some sizeable repositories, and the sad truth is that most of the time, you really don't need that history from 5 years ago. Recent commits carry a lot of value, and the older the commits, the lower the value.

We contemplated starting "fresh" by creating a new git repo, copying all the files across, and archiving the old repository. This has some pros and cons:

  • Pro git clone becomes much faster as the repository is much smaller without the history. Taking the .NET runtime repo as an example, it would go from 631MiB to ~90MiB
  • Pro You can still see the project history prior to the change in the old repository.
  • Con git blame becomes less useful, as you can't see any history prior to the first commit
  • Con If you need information from both the new and old repository, that's hard to work with

We decided against it in the end, as the benefits just weren't good enough to justify the inconvenience when you do need the history.

However, I found out recently that we could have had our cake and eaten it by using git replace. For the rest of the post I'll explore how we can achieve the best of both worlds:

  • A small "current" repository containing a minimal amount of history
  • A "history" repository containing all the commits prior to the creation of the new repository
  • The ability to merge them together when required, grafting the history onto our "current" repository!

1. Creating the history repository

For convenience, I'm going to demonstrate the general approach using one of my smaller repositories, StronglyTypedId, but you would really only want to do this with repositories that have become too large and unwieldly. You also need to be aware that this requires some rewriting (rebasing) of commits, so be careful!

This post is based on the Git Tools - Replace entry from the git book - it's pretty much the same thing, I've just worked through it to make sure I understand!

We'll start by cloning the original repository into a folder:

git clone https://github.com/andrewlock/StronglyTypedId
cd StronglyTypedId

From the command line, this looks like the following (git log --oneline).

0a698f1 (HEAD -> main, tag: v1.0.0-beta06, origin/main, origin/HEAD) Bump version to beta06
0a1a180 Add "#pragma warning disable 1591" to generated code
590f4d3 Add support for EF Core global conversions
5242f13 Add support for NewId package (#52)
b0eb121 fixes #44
9de5ea0 (tag: v1.0.0-beta05) Merge pull request #42 from andrewlock/single-package-attempt-2
f101d65 Remove unneccessary constant
...

Obviously it goes on beyond that, but that's sufficient for now. Visually, these commits are linearly ordered with each pointing to the parent commit:

Visual example of the repo

Lets say, we want to "archive" the history from the "Add support for EF Core global conventions" commit (590f4d3). First, we would create a new branch called history at this commit:

git branch history 590f4d3

So our git commits look like this:

Visual example of the repo after creating the history commit

Now we can create our "project-history" repository by pushing this branch to a new remote git repository.

In practice you'd create a new git repository on GitHub (for example) and push to that. Instead, for simplicity, I'm going to push to a local folder for now, and treat that as a remote

# Initialize the "remote" repository
git init C:\repos\git-replace\history

# Add the remote repository to original clone
git remote add project-history C:\repos\git-replace\history

# Push the history branch to the remote repository
git push project-history history:main

After pushing to the remote repository, our original clone looks something like this:

Visual example of the main repository in gitk

While the remote repository contains only the commits up to and including "Add support for EF Core global conventions" (590f4d3):

Visual example of the project-history repository

That's our "project-history" repository complete.

2. Squashing history to reduce repository size

So we now have a "project-history" repository, which contains the historic commits for the repository. However, our main clone also contains all those commits currently. We want to squash all those commits together to reduce the size of the repository (and losing the commit history in the process).

So that things work correctly, we need to squash everything prior to the latest commit in the project-history repository. So we need to leave 590f4d3 untouched, and squash everything prior to this commit.

We can do this using git commit-tree. This command is a 'plumbing' command that you won't commonly need to run. This essentially takes a "snapshot" of the working directory at a given commit, and creates a new commit containing the files.

To help our future-selves, we include some instructions in the commit message of this new commit about how to reconstitute the history. We'll come back to this later.

To create the squashed commit we use the format

git commit-tree -m <commit message> "<commit>^{tree}"

Where <commi> is the commit reference for which we want to generate the tree. For our purposes, we use the parent of the latest commit in the "project-history" repository (590f4d3~ means "the parent of 590f4d3"):

> git commit-tree -m "For historic commits, run 'git replace <child-ID> 590f4d3'" "590f4d3~^{tree}"
d3bee05dac84c66b7d13f99a5edf790688f51494
# The returned value d3bee05 is the commit ID of the new commit

This creates a new "floating" commit (d3bee05), which contains the contents of the working-directory at the state of the commit 590f4d3~, i.e. the parent commit of 590f4d3. You can think of it as squashing all of the previous commits into one. This will be the source of the "compaction" to make our repository smaller.

Visual example of the floating commit

Now we're going to rebase the remaining commits on top of this "floating" root commit. We rebase everything that wasn't squashed on top of our new floating commit

Although not necessary in this case, I've added --rebase-merges to keep the "merge commit" structure if you're using merge commits for your PR merges.

git rebase 590f4d3~ --onto d3bee05  --rebase-merges

Visually, we're "moving" the commits onto the new floating root commit:

Visual example of the result of the rebase

Note that the "Add support for EF Core global conversions" commit appears as both the last commit in the project-history repository and as the first "real" commit in this repository. This is important, as you'll see later.

If you check the git log using git log --oneline then the whole history looks like this:

546774e (HEAD -> master) Bump version to beta06
0541428 Add "#pragma warning disable 1591" to generated code
92305a9 Add support for EF Core global conversions
d3bee05 For historic commits, run 'git replace <child-ID> 590f4d3b'

I haven't truncated the history, that's all there is! If we look at the state of our current clone in gitk you can see that the main branch is based off the floating commit d3bee05, and is now completely separate from the history branch

Visual example in git view

At this point, we're done crafting our new "smaller" repository, so we can push our main branch to the repository, effectively compacting the history. Notice however, that all the commit hash IDs have changed. That's expected, because we've rebased, but it's something you need to take into consideration before continuing!

WARNING, this is rewriting history of the main branch! You probably shouldn't push these changes in a public repo. A (better?) alternative would be to push this to an entirely new repository and archive the original instead.

For this example, I'm going to assume you're working on a "private" repo, so you can force-push to main without the sky falling:

git push origin main --force-with-lease

With that we're basically done. You have a small "current" git repository to use for active development, and a "project-history" repository that contains the remaining history. In the next section we'll look at how to join them using git replace.

Joining two repositories with git-replace

The small repository will be quicker and easier to use, but at some point you're going to want to reconstitute the history so you can do a git blame, or something similar. We're going to do that by using git replace.

Lets imagine you're a new user, working in a fresh clone of the new, compacted, repo:

git clone https://github.com/andrewlock/StronglyTypedId

From your point of view, you currently only have the commits from the main clone:

Image of the state of the repo

But you realise that to understand some of the code, you need to check the history, so you add the "historical" repo as a new remote, and create a history branch to track it.

> git remote add project-history C:\repos\git-replace\history
> git fetch project-history
From C:\repos\git-replace\history
 * [new branch]      main     -> project-history/main
> git branch history project-history/main
branch 'history' set up to track 'project-history/main'.

Now you have recent commits in the main branch and the historical commits in the project-history/main branch, but they're disconnected, so your git tooling (and IDEs) still won't see the "historical" commits.

Image of the state of the repo with the main branch disconnected from the history

To connect the two repositories, we're going to use the fact that the "Add support for EF Core global conversions" commit appears in both trees. We essentially "replace" the main version of the commit (92305a9) with the history version of the commit (590f4d3). By doing that the two trees will become connected, and all our existing git tools will work as we would like!

Image of what we're going to do

To do the replacements, and combine the branches you need to run git replace <main-id> <historical-id>

git replace 92305a9 590f4d3

After running this command, the repository now looks something like this:

Image of what we've got!

All our usual git commands (and IDEs) will think that the parent of 92305a9 points to a commit from the project-history repository, for example git log --oneline:

546774e (HEAD -> main, origin/main, origin/HEAD) Bump version to beta06
0541428 Add "#pragma warning disable 1591" to generated code
92305a9 (replaced) Add support for EF Core global conversions
5242f13 Add support for NewId package (#52)
b0eb121 fixes #44
9de5ea0 (tag: v1.0.0-beta05) Merge pull request #42 from andrewlock/single-package-attempt-2
f101d65 Remove unneccessary constant
...

Note that

  • The first commits in main have the same commit IDs. git replace hasn't changed these. This isn't a rebase.
  • The commit that was replaced has the same commit ID (from main), but is marked as (replaced).
  • All subsequent parent commits from the historical repo are there, with their original commit IDs!

After doing what you need to do, you can revert all of these changes by running:

# Delete the replacement reference
git replace -d 92305a9
# Delete the history branch
git branch -d history
# Remove the remote repository
git remote remove project-history

Now you'll be back to a nice small repository again!

Summary

In this post I explored a use-case for the git replace command. If you have a very large repository, at some point you might find it useful to "archive" the history, by pushing it to a separate git repo, and creating a "compacted" repo for future development. The downside to this approach is it makes running commands like git blame or git log problematic. You can work around this limitation by using git replace to graft one commit tree onto another.


Viewing all articles
Browse latest Browse all 743

Trending Articles