user3586586
user3586586

Reputation: 51

Remove unused large files from Git within a range

My repo is forked from an open sourced project, so I don't want to modify the commits before the ForkPoint tag. I've tried the BFG Repo Cleaner but it doesn't allow me to specify a range.

I want to

  1. Go through the history in ForkPoint..HEAD^
  2. Rewrite the commits to delete all files larger than 10M

How to remove unused objects from a git repository? says it should be something like this

BADFILES=$(find . -type f -size +10M -exec echo -n "'{}' " \;)
git filter-branch --index-filter \
"git rm -rf --cached --ignore-unmatch $BADFILES" ForkPoint..HEAD^

but wouldn't BADFILES only contain the files that exist in HEAD?

For instance, if I've mistakenly committed a HUGE_FILE then later made another commit that removes that file, the BADFILES search wouldn't find the HUGE_FILE since find doesn't see it in the current working tree.


Edit1: Now I'm considering using BFG on a clone, then moving my fork onto the original ForkPoint. Would this be the right command, given fatRepo and slimRepo?

mkdir merger ; cd merger ; git init
git remote add fat  ../fatRepo
git remote add slim ../slimRepo
git fetch --all
git checkout fat/ForkPoint
git cherry-pick slim/ForkPoint..slim/branchHead

Edit2: Cherry-picking didn't work because cherry-picking can't handle merges in slimRepo. Can I somehow crush down the history of slimRepo, and simply merge onto fatRepo/ForkPoint?

git <turn into a single commit> slim/rootNode..slim/ForkPoint
git checkout fat/ForkPoint
git merge slim/branchHead

Upvotes: 2

Views: 907

Answers (1)

torek
torek

Reputation: 488463

Yes, you are correct.

If you can identify the files in advance, just list them manually.

If you need to pick large files from each commit, you can:

  • use the index-filter (as shown in your example above) but check for large files in $GIT_COMMIT, or
  • use a tree-filter and simply remove large files

(or of course anything else you can come up with).

The index-filter is much faster as it allows you (and git) to skip the messy business of turning each to-be-filtered commit into a work-tree, and vice versa. If there are few commits to copy, however, you will be putting time and mental effort into something with an overall small return. If you wish to go this way, note that you need sufficient quoting to extract $GIT_COMMIT from the variables available at the time the eval occurs (see, e.g., the script trick below, since it's put into the environment).

The tree-filter is easy to use: in this case, git extracts the original commit into a clean, empty sub-directory (by default, a sub-directory created within the .git directory containing the repository, but see the -d argument) and runs your filter (in that sub-directory). Whatever files remain afterward are put into a new commit with the other filters, if any, also applied (in the order given in the documentation). So your tree-filter could simply be:

find . -type f -size +10M -exec rm '{}' ';'

Note that the string is passed to eval so it is necessary to use several levels of quoting. Alternatively, you can simply run it by a full path name: put your script in a file such as /tmp/cleanup.sh, make it executable, and use:

git filter-branch --tree-filter /tmp/cleanup.sh ForkPoint..HEAD^

The tree-filter will be slow, but you might not care that much, especially if your range contains only a handful of commits.


Edit: to find large files in a particular commit (or other tree) by looking at the tree stored in that commit—this is what you would need in an index filter—you can use use this script-ette (lightly tested):

git ls-tree -lr $ref |
while read mode type hash size path; do
    [ $size -gt $limit ] && echo $size $path
done

Choose suitable values for $ref ($GIT_COMMIT in an index filter) and $limit. Change the echo command to git rm --cached -- $path to remove them in the filter. (You won't need --ignore-unmatch since the found paths are found by looking at the tree for that commit.)

You can see what this would do by using git rev-list to prepare a set of refs first:

git rev-list ForkPoint..HEAD^ | /tmp/script

where /tmp/script is:

check_tree() {
    git ls-tree -lr $1 |
    while read mode type hash size path; do
        [ $size -gt $limit ] && echo $size $path
    done
}

limit=1000000 # or whatever number

while read rev; do
    check_tree $rev
done

Then use a slightly modified script (as noted above) as the actual index filter, once you have found the desired size-limit value.

Upvotes: 1

Related Questions