Can git automatically recognise identical commits (With different hashes)

Question

I have a situation where I have two repository histories that have been duplicated and mangled (via interaction and migration around SVN--not my choice). I have both repositories as remotes in the same temporary maintenance repository. They share a few hundred commits worth of history, and then the "old" one continues for a few dozen more on a few branches. I need to fast-forward the "new" tree up to the state of the old one. Because of the mangling however, despite having identical content, they are not recognised as the same tree.

I would like a way to tell git "These two commits are identical, despite having different authors" (author ID was confused in translation). If possible, I would then really like if it could traverse the two remote trees and make that association for every node with identical content. This would mean I could then manually mark "commit 1" on both, and have it do the rest. Otherwise I would need to manually mark the root of every divergence (wouldn't be too bad, but would prefer not to).

I tried using graft points, which is nearly what I want-- gitk shows what I want, but when I pushed it back to the main (new) repository, it dragged along the couple-hundred duplicate commits. It's also a bit annoying to do, since I have to do it for a not-yet-merged child node.

I found https://stackoverflow.com/a/973403/372757 , and think that it will work: I will merely need to rebase the old commits onto the new repository, once for each branch.

None the less, I would still like to know if my original request is possible.

twalberg · Accepted Answer

git has a pretty strict definition of what an "identical commit" would be, that probably doesn't match what you're thinking. In order to be an identical commit, all of the following must be true:

every file in the tree to be committed must be byte-for-byte identical to the same file in the commit that will become the parent of the new commit (i.e. the current HEAD)
no new files, no removed files, no reorganization - the tree must match exactly, since the SHA1 of a tree depends on the files and subtrees it contains; if any leaf on the tree is different, the SHA1 of the top-level tree will be different
exactly the same author and committer name and email values
exactly the same author and commit dates
exactly the same current value of HEAD, which becomes the parent of the new commit
exactly the same commit message
possibly a couple other details that I'm missing

All of these things are either directly or indirectly used in generating the SHA1 hash for the new commit, and thus a commit won't be identical unless it's truly identical.

That said, and I think possibly more to the point of your question, when generating a new commit, if a particular file or tree is byte-for-byte identical to what an object that is already in the database, because another commit had those things in exactly the same state, then the new commit will point to those already existing objects - they won't be stored again.

If it's only author information that differs in two branches (which will be a different sequence of commits, even if the file contents matched entirely with another branch), you can use git filter-branch or git rebase to rewrite a branch, fixing the information as you go, but that will result in a whole new set of commits (but all the trees and file objects can potentially stay the same, assuming you don't change anything other than commit messages, times, or author/committer names). Note however, that if other work (by yourself or others) is already based off the existing branch, there can be a significant amount of cleanup involved in making such changes.

Can git automatically recognise identical commits (With different hashes)

Answers (2)

Related Questions