Cameron Hudson
Cameron Hudson

Reputation: 3931

git: How to remove *all* files from the git history that are not currently present?

I've seen several articles and questions about how to remove a single file from all git history. Example: How to remove/delete a large file from commit history in Git repository?

What I'd like to do is remove all files that are not currently present at the head of the master branch.

My use case is that I'm splitting off a smaller repository (call it small) from a monolithic repository (call it monolith). I want to preserve the git history when creating small, but only the relevant git history.

First, I created a new repository small on GitHub. Then, on my laptop, I added it as a remote named origin-small to my local monolith repository, and pushed the current state of the master branch of monolith to origin-small.

I then removed the remote origin-small from monolith, changed directories, and cloned small from GitHub. Voilà, I had a copy of my original repository, monolith, with its full history.

But, there are loads of files in the history of small that are no longer relevant, and they are bloating the repo.

What I'd like to do is:

  1. Delete all of the unnecessary files from small.
  2. Run a command to clear the whole git history of the files that I just deleted.

Is there a way to do this with a single command? Or do I need to run git filter-branch once for every file/directory that I want to remove?

Upvotes: 18

Views: 6335

Answers (2)

ElpieKay
ElpieKay

Reputation: 30858

List all the files that exist in the old commits.

git rev-list HEAD | sed 1d | xargs -i git ls-tree -r {} --name-only | sort -u

List all the files that exist in the head.

git ls-tree -r HEAD --name-only | sort -u

Get the files that don't exist in the head (reference).

files=$(comm -23 <(git rev-list HEAD | sed 1d | xargs -i git ls-tree -r {} --name-only | sort -u) <(git ls-tree -r HEAD --name-only | sort -u))

Replace the invisible characters (which I guess are new-lines) with spaces, otherwise it would cause an error in git filter-branch.

lostfiles=$(echo $files | sed -e 's/\s/ /g')

Remove lostfiles from the history:

git filter-branch -f --tree-filter "rm -rf ${lostfiles}" --prune-empty

It's possible to compose them to a single command, but I don't know if there would be any performance issue, so I'd prefer separate commands.

Upvotes: 4

Cameron Hudson
Cameron Hudson

Reputation: 3931

I ended up using git-filter-repo. WARNING: This approach is NOT able to update tags on the remote, if there are any.

  1. Install git-filter-repo.

    brew install git-filter-repo
    
  2. Clone your desired repo, in mirror form.

    git clone --mirror <my-repo-url>
    
  3. Enter the repo directory.

    cd <my-repo-name>
    
  4. Analyze the repo to identify all files that are in the history, but no longer exist.

    git filter-repo --analyze
    
  5. In the analysis output directory, there will be a file named path-deleted-sizes.txt that contains a list all files that were committed at some point, and were later deleted, but still exist in the git history.

    Create a new file that lacks the headers and other columns.

    tail +3 ./filter-repo/analysis/path-deleted-sizes.txt \
        | tr -s ' ' \
        | cut -d ' ' -f 5- \
        > ./filter-repo/analysis/path-deleted.txt
    
  6. Clean the git history of all files that no longer exist. This will also clean dirty commits, remove empty commits, and recompress everything for you.

    git filter-repo --invert-paths --paths-from-file ./filter-repo/analysis/path-deleted.txt
    
  7. Clean up the ./filter-repo directory, or you won't be able to push your changes.

    rm -rf ./filter-repo
    
  8. Force-push all refs to the origin. It will force-push, even though the command doesn't indicate it. Also, it will update all branches on the remote, which is convenient. If you have branch protection enabled on some branches in GitHub/Bitbucket/etc., then you will need to allow force-pushes. You can always re-run this command if you find that some refs could not be force-pushed.

    git push
    

Upvotes: 26

Related Questions