Reputation: 9255
I have a GitHub repository that had two branches - master and release.
The release branch contained binary distribution files that were contributing to a very large repository size (more than 250 MB), so I decided to clean things up.
First I deleted the remote release branch, via git push origin :release
.
Then I deleted the local release branch. First I tried git branch -d release
, but Git said "error: The branch 'release' is not an ancestor of your current HEAD." which is true, so then I did git branch -D release
to force it to be deleted.
But my repository size, both locally and on GitHub, was still huge. So then I ran through the usual list of Git commands, like git gc --prune=today --aggressive
, without any luck.
By following Charles Bailey's instructions at SO 1029969 I was able to get a list of SHA-1 hashes for the biggest blobs. I then used the script from SO 460331 to find the blobs...and the five biggest don't exist, though smaller blobs are found, so I know the script is working.
I think these blogs are the binaries from the release branch, and they somehow got left around after the delete of that branch. What's the right way to get rid of them?
Upvotes: 171
Views: 142461
Reputation: 2235
Put this in your git config
[gc]
pruneexpire = now
reflogExpireUnreachable = now
Then run
$offending_commit=XXXXXXXXXXXXXXXXXXXXXXXXX
git fsck --unreachable --no-reflogs | grep $offending_commit
If you see the commit, great, it is target for extermination
git reflog expire --expire=now --expire-unreachable=now --all
git gc --prune=now
But if the commit is not there, or it still appears after cleaning up, then it is not a dangling commit, there might be a branch or tag pointing to something, that points to something, (...), that points to the offending commit. To find out what keeps the offending commit alive even after cleanup, run this command
offending_commit=XXXXXXXXXXXXXXXXXXXXXXXXX
ff=$offending_commit
while :
do
final_node=$ff
ff=$(gitlog--format='%H %P'--all|grep-F"$ff"|cut-f1-d''|head-1)
if[[-z$ff]];then
break
fi
done
echo $final_node
The previous snippet will find a leaf commit that points to a commit, that points to a commit, that points to a commit, (...), that points to the offending commit.
Remove whatever tag or branch points to that $final_node commit. Run the previous snippet again in case there is another tag/branch pointing to something that points to the offending commit, then try the clean steps again
Restore your git config to whatever it was
Upvotes: 0
Reputation: 8359
I present to you this useful command, "git-gc-all", guaranteed to remove all your Git garbage until they might come up extra configuration variables:
git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \
-c gc.rerereunresolved=0 -c gc.pruneExpire=now gc
You might also need to run something like these first:
git remote rm origin
rm -rf .git/refs/original/ .git/refs/remotes/ .git/*_HEAD .git/logs/
git for-each-ref --format="%(refname)" refs/original/ |
xargs -n1 --no-run-if-empty git update-ref -d
You might also need to remove some tags:
git tag | xargs git tag -d
Upvotes: 291
Reputation: 9138
You can (as detailed in this answer) permanently remove everything that is referenced only in the reflog.
WARNING: This will remove many objects you might want to keep:
Read the documentation to be sure this is what you want.
To expire the reflog
, and then prune all objects not in branches:
git reflog expire --expire-unreachable=now --all
git gc --prune=now
git reflog expire --expire-unreachable=now --all
removes all references of unreachable commits in reflog
.
git gc --prune=now
removes the commits themselves.
Attention: Only using git gc --prune=now
will not work since those commits are still referenced in the reflog. Therefore, clearing the reflog is mandatory. Also note that if you use rerere
it has additional references not cleared by these commands. See git help rerere
for more details. In addition, any commits referenced by local or remote branches or tags will not be removed because those are considered as valuable data by git.
Upvotes: 133
Reputation: 1716
You can use git forget-blob
.
The usage is pretty simple:
git forget-blob file-to-forget
You can get more information in Completely remove a file from a Git repository with 'git forget-blob'.
It will disappear from all the commits in your history, reflog, tags, and so on.
I run into the same problem every now and then, and every time I have to come back to this post and others. That's why I automated the process.
Credits go to contributors such as Sam Watkins.
Upvotes: 2
Reputation: 2277
To add another tip, don't forget to use git remote prune to delete the obsolete branches of your remotes before using git gc.
You can see them with git branch -a
It's often useful when you fetch from GitHub and forked repositories...
Upvotes: 1
Reputation: 115
Before doing git filter-branch
and git gc
, you should review tags that are present in your repository. Any real system which has automatic tagging for things like continuous integration and deployments will make unwanted objects still referenced by these tags, hence gc
can't remove them and you will still keep wondering why the size of the repository is still so big.
The best way to get rid of all unwanted stuff is to run git-filter
& git gc
and then push master to a new bare repository. The new bare repository will have the cleaned-up tree.
Upvotes: 3
Reputation: 22526
Each time your HEAD moves, Git tracks this in the reflog
. If you removed commits, you still have "dangling commits" because they are still referenced by the reflog
for about 30 days. This is the safety net when you delete commits by accident.
You can use the git reflog
command to remove specific commits, repack, etc., or just the high level command:
git gc --prune=now
Upvotes: 15
Reputation: 1830
Try to use git-filter-branch - it does not remove big blobs, but it can remove big files which you specify from the whole repository. For me it reduces repository size from hundreds MB to 12 MB.
Upvotes: 1
Reputation: 1326782
As mentioned in this SO answer, git gc
can actually increase the size of the repo!
See also this thread
Now git has a safety mechanism to not delete unreferenced objects right away when running '
git gc
'.
By default unreferenced objects are kept around for a period of 2 weeks. This is to make it easy for you to recover accidentally deleted branches or commits, or to avoid a race where a just-created object in the process of being but not yet referenced could be deleted by a 'git gc
' process running in parallel.So to give that grace period to packed but unreferenced objects, the repack process pushes those unreferenced objects out of the pack into their loose form so they can be aged and eventually pruned.
Objects becoming unreferenced are usually not that many though. Having 404855 unreferenced objects is quite a lot, and being sent those objects in the first place via a clone is stupid and a complete waste of network bandwidth.Anyway... To solve your problem, you simply need to run '
git gc
' with the--prune=now
argument to disable that grace period and get rid of those unreferenced objects right away (safe only if no other git activities are taking place at the same time which should be easy to ensure on a workstation).And BTW, using '
git gc --aggressive
' with a later git version (or 'git repack -a -f -d --window=250 --depth=250
')
The same thread mentions:
git config pack.deltaCacheSize 1
That limits the delta cache size to one byte (effectively disabling it) instead of the default of 0 which means unlimited. With that I'm able to repack that repository using the above
git repack
command on an x86-64 system with 4GB of RAM and using 4 threads (this is a quad core). Resident memory usage grows to nearly 3.3GB though.If your machine is SMP and you don't have sufficient RAM then you can reduce the number of threads to only one:
git config pack.threads 1
Additionally, you can further limit memory usage with the
--window-memory argument
to 'git repack
'.
For example, using--window-memory=128M
should keep a reasonable upper bound on the delta search memory usage although this can result in less optimal delta match if the repo contains lots of large files.
On the filter-branch front, you can consider (with cautious) this script
#!/bin/bash
set -o errexit
# Author: David Underhill
# Script to permanently delete files/folders from your git repository. To use
# it, cd to your repository's root and then run the script with a list of paths
# you want to delete, e.g., git-delete-history path1 path2
if [ $# -eq 0 ]; then
exit 0
fi
# make sure we're at the root of git repo
if [ ! -d .git ]; then
echo "Error: must run this script from the root of a git repository"
exit 1
fi
# remove all paths passed as arguments from the history of the repo
files=$@
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD
# remove the temporary history git-filter-branch otherwise leaves behind for a long time
rm -rf .git/refs/original/ && git reflog expire --all && git gc --aggressive --prune
Upvotes: 34
Reputation: 576
Sometimes, the reason that "gc" doesn't do much good is that there is an unfinished rebase or stash based on an old commit.
Upvotes: 1
Reputation: 323752
git gc --prune=now
, or low level git prune --expire now
.
Upvotes: 23