R.M.
R.M.

Reputation: 3643

How to use git filter-branch to remove a file by blob SHA1?

Most of the git filter-branch examples I've seen that are removing files have been to remove files based on filename. I don't necessarily want to do that. Instead, I've identified a number of blob (not commit) SHA1s of the files I want to remove, regardless of where they are in the repository. (Due to our repo history, files tend to move around a bunch without changing.)

What's the best way to tell git filter-branch to remove files based on their blob SHA1?

Upvotes: 4

Views: 3485

Answers (4)

Jean Paul
Jean Paul

Reputation: 1588

git filter branch --index-filter puts iteratively each commit into the index so it is possible to recover the filename from the hash with git ls-files -s.

I do this to remove blobs with hashes 2d341f0223ff, 6a4558fa76d1 and 4d0a90cba061:

git filter-branch --force --index-filter "git ls-files -cdmo -s | grep ' 2d341f0223ff\| 6a4558fa76d1\| 4d0a90cba061' | awk '{print $4}' | xargs git rm --cached --ignore-unmatch 656565randomstring546464" --prune-empty --tag-name-filter cat -- --all

The random string is to avoid that git rm raises an error when grep returns no match.

Upvotes: 1

Mark Adelsberger
Mark Adelsberger

Reputation: 45819

As noted by @RobertTyley in his answer, you're probably better off using BFG. However, to answer the question as asked (how to do this with filter-branch):

There isn't a great way unfortunately. You can write a script to get all filenames associated with the SHA value in the index. As a starting point, if you're removing a file with hash DEADC0DE

git rev-list -n 1 --objects HEAD |grep ^DEADC0DE |cut -c 42-

You'd then feed each line (perhaps with xargs?) as the <filename> in

git rm --cached <filename>

And you'd use that script as your index-filter value (because using it as a tree filter will just make an already slow approach even slower).

Upvotes: 0

Josh Lee
Josh Lee

Reputation: 177825

The filter-branch version could look something like this inside of index-filter:

git ls-files -s |
  sed -r '/ 02c97746d64fbfe13007a1ab4e9b9e4bbd99f42f /s/^100(644|755)/0/' |
  git update-index --index-info

That is, read the index-info format, find the interesting blob and set the mode to 0 (marking it for removal), then write that back to the index.

Upvotes: 0

Roberto Tyley
Roberto Tyley

Reputation: 25314

Your task is to remove blobs from Git history by a hash identifier. You may find it faster and easier to use the BFG rather than git-filter-branch, specifically using the --strip-blobs-with-ids flag:

-bi, --strip-blobs-with-ids <blob-ids-file> ...strip blobs with the specified Git object ids

Carefully follow the usage instructions, the core part is just this:

$ java -jar bfg.jar  --strip-blobs-with-ids <blob-ids-file>  my-repo.git

Note that the <blob-ids-file> file should contain Git object ids, rather plain SHA-1 hashes of the blob's contents.

For a given file, you can calculate the Git object id with git hash-object:

$ git hash-object README.md
a63b49c2e93788cd71c81015818307c7b70963bf

You can see that this value is different to a simple SHA-1 hash:

$ sha1sum README.md
7b833f7b37550e2df719b57e8c4994c93a865aa9  README.md

...that's because the Git object id hashes a Git header, along with the contents of the file, even though it does use the same SHA-1 algorithm.

The BFG is typically at least 10-50x faster than running git-filter-branch, and generally easier to use.

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Upvotes: 8

Related Questions