Reputation: 3931
I've seen several articles and questions about how to remove a single file from all git history. Example: How to remove/delete a large file from commit history in Git repository?
What I'd like to do is remove all files that are not currently present at the head of the master branch.
My use case is that I'm splitting off a smaller repository (call it small
) from a monolithic repository (call it monolith
). I want to preserve the git history when creating small
, but only the relevant git history.
First, I created a new repository small
on GitHub. Then, on my laptop, I added it as a remote named origin-small
to my local monolith
repository, and pushed the current state of the master branch of monolith
to origin-small
.
I then removed the remote origin-small
from monolith
, changed directories, and cloned small
from GitHub. Voilà, I had a copy of my original repository, monolith
, with its full history.
But, there are loads of files in the history of small
that are no longer relevant, and they are bloating the repo.
What I'd like to do is:
small
.Is there a way to do this with a single command? Or do I need to run git filter-branch
once for every file/directory that I want to remove?
Upvotes: 18
Views: 6335
Reputation: 30858
List all the files that exist in the old commits.
git rev-list HEAD | sed 1d | xargs -i git ls-tree -r {} --name-only | sort -u
List all the files that exist in the head.
git ls-tree -r HEAD --name-only | sort -u
Get the files that don't exist in the head (reference).
files=$(comm -23 <(git rev-list HEAD | sed 1d | xargs -i git ls-tree -r {} --name-only | sort -u) <(git ls-tree -r HEAD --name-only | sort -u))
Replace the invisible characters (which I guess are new-lines) with spaces, otherwise it would cause an error in git filter-branch
.
lostfiles=$(echo $files | sed -e 's/\s/ /g')
Remove lostfiles
from the history:
git filter-branch -f --tree-filter "rm -rf ${lostfiles}" --prune-empty
It's possible to compose them to a single command, but I don't know if there would be any performance issue, so I'd prefer separate commands.
Upvotes: 4
Reputation: 3931
I ended up using git-filter-repo
. WARNING: This approach is NOT able to update tags on the remote, if there are any.
Install git-filter-repo
.
brew install git-filter-repo
Clone your desired repo, in mirror form.
git clone --mirror <my-repo-url>
Enter the repo directory.
cd <my-repo-name>
Analyze the repo to identify all files that are in the history, but no longer exist.
git filter-repo --analyze
In the analysis
output directory, there will be a file named path-deleted-sizes.txt
that contains a list all files that were committed at some point, and were later deleted, but still exist in the git history.
Create a new file that lacks the headers and other columns.
tail +3 ./filter-repo/analysis/path-deleted-sizes.txt \
| tr -s ' ' \
| cut -d ' ' -f 5- \
> ./filter-repo/analysis/path-deleted.txt
Clean the git history of all files that no longer exist. This will also clean dirty commits, remove empty commits, and recompress everything for you.
git filter-repo --invert-paths --paths-from-file ./filter-repo/analysis/path-deleted.txt
Clean up the ./filter-repo
directory, or you won't be able to push your changes.
rm -rf ./filter-repo
Force-push all refs to the origin. It will force-push, even though the command doesn't indicate it. Also, it will update all branches on the remote, which is convenient. If you have branch protection enabled on some branches in GitHub/Bitbucket/etc., then you will need to allow force-pushes. You can always re-run this command if you find that some refs could not be force-pushed.
git push
Upvotes: 26