Timmmm
Timmmm

Reputation: 97008

Rewrite git history to modify a file

To remove a large unwanted file from all of git history you can use filter-branch to rewrite the index (the list of files in the repo) of each commit so the file was never added.

git filter-branch --index-filter "git rm --cached --ignore-unmatch path/to/offending_file.wav" --tag-name-filter cat -- --all

But what if I want to keep the file but make it a lot smaller (e.g. imagine if an icon was accidentally stored as a huge image). I tried this approach:

First add a replacement file to git's database

HASH=`git hash-object -w /tmp/replacement.png`

Also note the file we want to replace

FILE="path/to/icon.png"

Now filter the index as follows: first check the file exists at this commit:

git cat-file -e :"$FILE"

If so remove it from the index:

git rm --cached "$FILE"

And finally add a reference to our replacement with the same filename.

git update-index --add --cacheinfo "100644,$HASH,$FILE"

Putting it all together:

git filter-branch --index-filter "if git cat-file -e :$FILE ; then git rm --cached $FILE ; git update-index --add --cacheinfo 100644,$HASH,$FILE ; fi" --tag-name-filter cat -- --all

This seems to work and doesn't print any errors that are too scary. However, no matter how many git gc and prune commands I try the original blob still exists in the repository. Even if I clone the repo to a new place it still exists.

I suspect it is because the remote refs, and the original refs which filter-branch creates still point to the old tree, so the original file is still referenced.

I did try removing them all with a hack like this:

for REF in `git show-ref | cut -c 42- | grep original` ; do git update-ref -d $REF ; done

And the same for remotes, but the blob is still there.

So my questions:

  1. Is there a way to see why a blob isn't garbage collected? I.e. which parents objects in the graph point to it?
  2. Is there a non-hacky way to remove the originals refs (and maybe the remotes) - including all branches and tags?
  3. Is there anything else I'm missing?

Upvotes: 1

Views: 109

Answers (1)

Timmmm
Timmmm

Reputation: 97008

Aha I've done it! I think.

Here are the extra steps. First it's a good idea to note the hash of the blob you want at the start so you can check if it exists with

git cat-file -t 949abcd....

Ok so first I cleared the reflog, since it still has a reference to the original clone:

git reflog expire --expire=now --all

Next I removed the origin remote, since it still has a reference to the original tree. I guess if you push the new hashes (probably need to force push) then this step will be unnecessary and the file should be eventually GCed anyway.

git remote rm origin

Next I removed the original refs (that filter-branch creates). I didn't find a less hacky way:

for REF in `git show-ref | cut -c 42- | grep original` ; do git update-ref -d $REF ; done

Finally, garbage collect. I'm not sure whether --aggressive is required but --prune=now definitely is because otherwise git gc only garbage collects old unwanted objects, for safety.

git gc --aggressive --prune=now

After all these steps git cat-file reports that the blob is gone! I haven't experimented with pushing the result back to origin (after you re-add it), and I'm not 100% sure which of the above steps are necessary, but this seemed to work so far.

Upvotes: 1

Related Questions