Reputation: 524
In our git repo, one of the branches contains binary files that were committed and pushed to the remote repo for testing, however this has led to unintended consequence of filling up the size of our repo. After doing some research here and here and then some, a number of scenarios are provided in which solutions vary greatly. I am wondering if we have a simpler scenario that avoids "git push --all --force"(which requires greater coordination) that we may be to take advantage of.
In our case, we don't care that the branch exists anymore and are perfectly fine with it getting removed (along with its history etc). We can take the work involved and recommit it in another branch. Since the branch hasn't been merged to its master, is it possible for us to delete the branch entirely. Assuming that references are self-contained in the branch to the binaries committed, is there a simpler solution?
From the research, the following solutions have been called out:
However they assume that the reader wants to retain the history and as such remove the offending binaries, rewrite the history and/or that the issue is still localized to the local repository. If the issue is remote, a fix to local is required and then push --all to remote.
In this case, we have already deleted the branch and recommitted the work on a fresh branch, but the size hasn't changed yet, what else do we need to do? Is there an easier solution since the data is localized to the deleted branch and the branch is allowed to be deleted? We also aren't sure if git will keep the binaries in some way in order to keep references to them in other parts of the history. Is garbage collection required at the remote server? pruning of references?
Upvotes: 0
Views: 3345
Reputation: 489293
Deleting the branch is, in general, the right answer. But there are a lot of fiddly little knobs to turn here. Some of them, you can just wait (about a month) and avoid dealing with. If you don't want to wait for the various copies of the repository to shrink on their own, though:
In this case, we have already deleted the branch and recommitted the work on a fresh branch, but the size hasn't changed yet ...
First, remember that Git is distributed by nature. Each repository is (at least in principle) wholly self-contained and independent of every other repository. So when you say that the repository has not yet slimmed-down, the obvious first question is: which one?
Any change you make to any one repository won't affect any other repository, at least, not until you cross-connect the two of them and tell one to fetch new work from the other, or push new work to the other. If you are doing all of this in a test clone, that's fine, just remember that the test clone's results will be specific to that one clone.
The immediate next problem is that Git, by its nature, ‘wants’ to make more copies of everything. Commits are like some viruses or diseases: connect one Git to another Git, and the Git that didn't have the commits has them now. The Git that did have the commits, still has them. When you do finally remove the commits from (say) sixteen clones, it will be absurdly easy for anyone, anywhere, who does have the commits in their clones to accidentally reintroduce them to the fixed-up clones, from which they'll spread back to all the others. That doesn't mean you can't be rid of the commits—and the "only reachable from one branch" nature of the way you have them now is going to simplify things quite a lot, as you merely need to make sure that no one else restores or merges that branch from their clone.
For lots of useful background, I recommend reading and working through the web site Think Like (a) Git. Once you have digested what's there, the way to shrink your repository is:
Make sure the commit(s) that have the large file(s) are unreachable. In your specific case, deleting the branch name gets you most of the way there: they were reachable from that branch name, and through that branch's reflogs. Deleting the branch removes its reflogs as well, so that path is now cleared out.
The place from which those commits can (probably) still be reached is in your HEAD
reflog. Running git reflog
will show you all the HEAD
reflog entries (the default action is show
and the default reflog to show is that for HEAD
). You could selectively expunge each such reflog entry, with, e.g., git reflog delete
, but it's easier to just delete all your HEAD
reflog entries with:
git reflog expire --expire=now --expire-unreachable=now
Note that this removes all your ability to restore otherwise accidentally lost HEAD
commits, so be very sure you are OK with this before you do it. You can leave out --expire=now
since the deleted-branch-specific commits should not be reachable from your current branch—I'm showing the "nuke it from orbit" variant of the command here.
Then, run git gc --prune=now
. This is the last step of the "checklist for shrinking a repository" from the git filter-branch
documentation.
This will take care of all of the various items needed to rebuild pack files and/or discard loose objects that hold the large files that are no longer reachable from any external name. That is, no external name points directly or indirectly to any commit that, through its tree or one of its tree's subtrees, points to the blob object holding the file. Thus, the gc
command will orchestrate the other commands (git repack
and git prune
) that will delete the unwanted objects.
(Note: If you are using .keep
files to retain old packs, you will have to remove those .keep
files and allow those packs to be destroyed. If you're doing this, though, you probably are not asking this question in the first place.)
Upvotes: 2