Quantcast
Channel: Andrew Lock | .NET Escapades
Viewing all articles
Browse latest Browse all 743

Rewriting git history simply with git-filter-repo

$
0
0

In this post I describe how I used git-filter-repo to rewrite the history of a git repository to move files into a subfolder.

Background: rewriting git history

As a git user, I like to Rebase. I like to make lots of small commits and tidy them up later using interactive rebase, and to rewrite my PRs to make them easier to understand (and review). I use git push origin --force-with-lease so much, that I have it aliased as git pof.

What I don't do is rewrite the history of my main/master branch. There's a whole world of pain there, as other people will likely have started branches from the branch, and they can easily end up in a complete mess.

However, sometimes it makes sense.

I was working on a small side project the other day, when I realised it would really make sense for it to effectively be a "monorepo". So rather than having all the existing code in the root directory, I wanted to move it to a child directory.

So I started with a directory that looked like this:

Directory before the changes

And I wanted a directory that looked like this:

Directory after the changes

The notable points here are:

  • Everything has been moved to an engine subfolder
  • Except the .gitattributes and .gitignore files, which are still at the top level.

The simplest way to do this is to just move all the files, and create a new commit with the changes, job done. The downside to that is that while git itself is ok at tracking file moves (it sometimes gets things wrong), it can cause some other issues.

For example, if you're looking at a file on GitHub, and you want to see what it looks like at a particular commit, then you can use the branch selector to change it. However, if the file has moved, you'll get a 404. Not a great experience.

Changing the branch for a file and getting a 404 in GitHub

If the odd file has moved, that's not a big deal, but if literally every file has moved, that's not a great experience.

So what's the alternative? Rewriting history!

Rewriting history: the options

With rewriting history, we update the git branches to make it look like all the files were originally committed to the engine subfolder. There's no "sudden move". The history shows them as always having been in the engine folder.

This sort of wholesale rewriting of your main/master branch is definitely not advisable if you are sharing the repo publicly. You will likely break all sorts of people's work!

Normally when I'm rewriting history I use git rebase -i in combination with git reset HEAD~. This lets me squash commits together, pause to split them apart, reorder them, or remove them entirely. That's great for when you're massaging a PR, but it's really not designed for wholesale rewriting of an entire repository.

For those scenarios, git filter-branch is a better option. This is a complex git command, that frankly, scares me. I have used it, on occasion, but the syntax is janky, you typically have to incorporate a lot of bash, it's often slow, and you could mess up your whole repository. Yay!

Just take a look at this Stack Overflow question which is about a similar requirement but in reverse—moving from the engine folder to the root. One of the suggested answers suggests running the following command:

git filter-branch -f --index-filter 'PATHS=`git ls-files -s | sed "s/^engine//"`; \
GIT_INDEX_FILE=$GIT_INDEX_FILE.new; \
echo -n "$PATHS" | \
git update-index --index-info \
&& if [ -e "$GIT_INDEX_FILE.new" ]; \
  then mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"; \
fi' -- --all

That's definitely something. Does it work? Probably. Would you want to write your own? Almost certainly not.

So instead of trying to figure out how to mangle git filter-branch to my liking, I decided to look at at a suggestion I saw elsewhere: git-filter-repo.

"Installing" git-filter-repo using Docker

git-filter-repo isn't built-in to git itself. In fact, it's a single Python file, but it's written to feel like a git plugin. And the really nice thing is that the API is so much nicer. That whole git filter-branch expression in the previous section could be rewritten with git-filter-repo to be something like this:

git filter-repo --path-rename engine/:

I think you'll agree that's much clearer! The manual is also very good, with lots of examples.

The only problem from my point of view, is that git-filter-repo is a Python module. Python on Windows can be problematic (even the install instructions make that clear) and while you can install Python from the Microsoft Store, I really didn't want to go through that. Docker to the rescue!

Docker is such a great use-case for something like this, where I want to quickly try a tool, and don't want to risk messing up my machine. Instead of installing Python, I'll run a Docker image that already has Python installed, map the drive to my project, and work inside the docker image!

git-filter-repo requires Python 3.5+, so I searched for Python on Docker Hub and found the official images. The python:3 image is a bullseye (Debian 11) image, with Python 3.10 installed, which would do nicely.

I ran the following command from inside my app to pull and run the Docker image, to map the current directory to the /app directory inside the container, set the working directory to /app, and to start a bash shell.

docker run --rm -it -v ${PWD}:/app -w /app python:3 /bin/bash

I now have a running Python container, but I don't have the git-filter-repo tool installed yet. The python:3 repo uses Debian 11, and according to the git-filter-repo install instructions, I needed to use the "backports" repository to install via apt-get:

A repository in this context refers to the server containing all the packages used by apt for installation into a Linux machine. It is separate from the concept of a "git repository".

Unfortunately the backports repository isn't enabled by default in Debian 11, so I followed the instructions from the backport website to add it to the sources list, and installed the git-filter-repo package:

# Add the backports repo to sources.list
echo 'deb http://deb.debian.org/debian bullseye-backports main' > /etc/apt/sources.list.d/backports.list

# Update the list of available packages
apt-get update

# Install git-filter-repo, adding the required /bullseye-backports suffix
apt-get install -y git-filter-repo/bullseye-backports

The logs indicated this had installed correctly, so I was ready to take it for a spin!

Using git-filter-repo to move files into a subdirectory

My first attempt to use git-filter-repo wasn't very successful. I tried running:

git filter-repo --to-subdirectory-filter engine/

which seemed like it would do most of what I wanted, but I was presented with the following:

> git filter-repo --to-subdirectory-filter engine/
Aborting: Refusing to destructively overwrite repo history since
this does not look like a fresh clone.
  (expected freshly packed repo)
Please operate on a fresh clone instead.  If you want to proceed
anyway, use --force.

This is very interesting! Rewriting history is obviously a very destructive process in which you can lose work, and git-filter-repo is doing its best to make sure you don't hurt yourself. As long as you have your work pushed to a remote git repository you should be fine, but to be safe, git-filter-repo requires you work in a fresh clone by default.

This seemed very sensible to me, so I did as it asked, created a fresh clone, and tried again:

> git filter-repo --to-subdirectory-filter engine/

Parsed 24 commits
New history written in 2.37 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at 547b073 Use alternate robots.txt
Enumerating objects: 375, done.
Counting objects: 100% (375/375), done.
Delta compression using up to 4 threads
Compressing objects: 100% (161/161), done.
Writing objects: 100% (375/375), done.
Total 375 (delta 189), reused 327 (delta 189), pack-reused 0
Completely finished after 6.32 seconds.

That's much better! As you can see from the logs, git-filter-repo was very busy, rewriting the commits. Taking a look at the results afterwards, everything except the .git folder had been moved to the engine subfolder:

All files have been moved to the engine subfolder

and the history (shown with gitk here) shows that the original commits were all to the engine folder.

gitk shows the files were always committed to the engine folder

This is almost exactly what I want, except I wanted the .gitignore and .gitattributes to remain at the top level.

I'll come back to those strange replace/* tags in the gitk image shortly

The easiest way to fix the .gitignore location was more rewriting! I ran the following command to move the .gitignore and .gitattributes files back up to the root folder:

> git filter-repo \
  --path-rename engine/.gitattributes:.gitattributes \
  --path-rename engine/.gitignore:.gitignore

Parsed 24 commits
New history written in 1.35 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at f554e31 Use alternate robots.txt
fatal: replace depth too high for object 8027f9f8670e3da4762099d39e733bcfa44fea39
fatal: failed to run pack-refs
Completely finished after 2.45 seconds.

That appeared to work, as I now had the folder structure I wanted. But there were two slightly worrying fatal error messages in the logs 🤔 On top of that, when I tried opening gitk I got the following error message:

Error reading commits: fatal: replace depth too high for object 8027f9f8670e3da4762099d39e733bcfa44fea39

That's a bit concerning 😟 Luckily, after a bit of Googling, I found I could fix the issue by running:

> git replace -d 8027f9f8670e3da4762099d39e733bcfa44fea39
Deleted replace ref '8027f9f8670e3da4762099d39e733bcfa44fea39'

After that, I could successfully open gitk, and could see that the .gitignore and .gitattributes files were again in the root, with everything else in the engine folder:

gitk shows the gitignore files in the root folder, with everything else in the engine subfolder

So with that, my work was pretty much done. But that fatal error was bugging me, as were all those extraneous replace/ refs.

It took me a little while to work out what those refs even were but eventually I pinned it down to a git feature called git-replace. That feature is worth a whole blog post on its own, so for now I'll just point you to the docs if you're interested, and I'll walk through the feature in a subsequent post.

I decided to start again, and this time I told git-filter-repo I didn't need the extra replace/ references by passing --replace-refs delete-no-add:

# Move everything to the engine/ subfolder
git filter-repo --replace-refs delete-no-add --to-subdirectory-filter engine/
# Move .gitignore and .gitattributes back to the root
git filter-repo --replace-refs delete-no-add \
  --path-rename engine/.gitattributes:.gitattributes \
  --path-rename engine/.gitignore:.gitignore

This time there were no fatal errors in the logs, gitk opened without any errors, and all the replace/ references were gone. Success! With that I could exit the Docker container, double check everything was correct, and do a git push origin --force-with-lease of my newly rewritten repo!

All in all, I'm very impressed with git-filter-repo, and using it inside the Docker container is clean and painless, so I'd definitely recommend it!

Summary

In this post I described a scenario where I wanted to rewrite the history of a git repository to make it appear as though some files were originally created in a sub-folder instead of the root folder. I described how to run a python:3 Docker container, how to install git-filter-repo, and the commands required to move all the files except .gitattributes and .gitignore to an engine subfolder. To make it simpler, I've reproduced the main steps here:

  1. Create a fresh clone of your repository, and cd to the clone directory
# Clone my/repo to output_directory
git clone https://github.com/my/repo output_directory
cd output_directory
  1. Run a python:3 Docker container interactively, and install git-filter-repo inside it
# run the Docker container
docker run --rm -it -v ${PWD}:/app -w /app python:3 /bin/bash

# inside the container, install git-filter-repo
# Add the backports repo to sources.list
echo 'deb http://deb.debian.org/debian bullseye-backports main' > /etc/apt/sources.list.d/backports.list

# Update the list of available packages
apt-get update

# Install git-filter-repo, adding the required /bullseye-backports suffix
apt-get install -y git-filter-repo/bullseye-backports
  1. Run the git-filter-repo commands to move all the files to the engine subdirectory, and then move the .gitignore and .gitattribute files back. Don't create replace/ refs.
# Move everything to the engine/ subfolder
git filter-repo --replace-refs delete-no-add --to-subdirectory-filter engine/

# Move .gitignore and .gitattributes back to the root
git filter-repo --replace-refs delete-no-add \
  --path-rename engine/.gitattributes:.gitattributes \
  --path-rename engine/.gitignore:.gitignore

Viewing all articles
Browse latest Browse all 743

Trending Articles