Nicholas Finch
Nicholas Finch

Reputation: 327

Stop git history, then add back together again later

I was brought on to version control a project that was not previously in version control.

Not thinking ahead, I added all files to a repository and started tracking it as soon as I started working on it.

Now the repository is enormous, too big to push to git hub. So I started deleting all excessive files and using git filter-branch to eradicate them from the history using this command.

sudo git log --all --pretty=format: --name-only --diff-filter=D | sort -u | while read -r line; do sudo git filter-branch -f --tree-filter "rm -rf { $line }" HEAD; done

The problem? There were so many excessive files and this is taking SO LONG that the messiah might return before it's done and I need to get this up to github quickly.

So speed up the process, I saw that I could just commit the latest files in an orphan branch

git checkout --orphan <new-branch-name>

So just to get moving, what I'd love to do is push just this commit to github, keep running the cleanup operation, and then essentially stick the two branches back together again once it's done.

In this manner

1-----10
         1a------Xa  (1a = 10) 

Becomes

1-----10-1a------Xa

Or possibly

1------10------Xa

So that in the end we save literally all history.

Is this possible? I'm under a time crunch and wouldn't want to lose everything.

Upvotes: 2

Views: 86

Answers (1)

torek
torek

Reputation: 489263

It's not possible as described, because the ID ("true name") of a commit is its hash checksum, which includes all of its history. Hence, in a repo containing these five commits on two branches:

A--B--C--D   <-- with-big-files

         D'  <-- cleaned

you can push either branch, but you can never make D' have, as as predecessor, any other commit. D' is a root commit and will always be a root commit.

What you can do is, e.g., add this cleaned2 branch:

A--B--C--D   <-- with-big-files

         D'     <-- cleaned

           A'-B'-C'   <-- cleaned2

and then merge:

A--B--C--D   <-- with-big-files

         D'---------E   <-- cleaned
                   /
           A'-B'-C'    <-- cleaned2

and then discard the name cleaned2. (If you like, cleaned2 can include D'' which is a copy of D and/or D', but has C' as its parent.)

Note that no matter whether you use git filter-branch or BFG or even this manual method, what you end up with is a bunch of copies of the original commits, where you have taken the huge files out of the copies.


Edit: this is not an answer to the question, but I thought I should add this side note. You have identified filter-branch as being too slow, but are now solving a different problem, rather than simply speeding up the filter-branch.

First, the filter you are using with git filter-branch (the --tree-filter) is the slowest possible method. It will be much faster (though still not exactly blazing fast) to do each of these removals as an --index-filter.

Second and actually even more important, instead of removing each file with one pass that completely copies every commit in the repository, you should do one pass over every commit in the repository to remove all such files (still using the index filter, to avoid copying every commit out to a work-tree).

The key that drives all of this is the way git filter-branch works, which I alluded to above. It is impossible to change a commit, in Git, so like all Git commands, filter-branch doesn't. It just seems to, and to make it seem as though some commit(s) is/are changed, Git copies every such commit to a new commit, then hides away the originals and pretends the copies are the originals.

Running git filter-branch HEAD copies every commit reachable from HEAD. I do not know how many commits are in your repository, but let's say there are 150 commits reachable from HEAD and 20 files to remove. You are removing one file per pass, so first you copy 150 commits to remove file A. Then you copy 150 commits (the ones that are minus file A) to remove file B. Then you copy 150 commits (the ones minus both A and B) to remove file C, and so on. This means you are making 150 x 20 = 3000 copies.

Using --index-filter (with git rm --cached --ignore-unmatch) will make the 3000 copies run orders of magnitude faster than using --tree-filter. Removing all the files at once will make 150 copies. If each one improvement reduces the time to 1/20th of the original, doing both together will reduce it to about 1/400th.

Upvotes: 2

Related Questions