Reputation: 7611
I have a situation where I have two repository histories that have been duplicated and mangled (via interaction and migration around SVN--not my choice). I have both repositories as remotes in the same temporary maintenance repository. They share a few hundred commits worth of history, and then the "old" one continues for a few dozen more on a few branches. I need to fast-forward the "new" tree up to the state of the old one. Because of the mangling however, despite having identical content, they are not recognised as the same tree.
I would like a way to tell git "These two commits are identical, despite having different authors" (author ID was confused in translation). If possible, I would then really like if it could traverse the two remote trees and make that association for every node with identical content. This would mean I could then manually mark "commit 1" on both, and have it do the rest. Otherwise I would need to manually mark the root of every divergence (wouldn't be too bad, but would prefer not to).
I tried using graft points, which is nearly what I want-- gitk shows what I want, but when I pushed it back to the main (new) repository, it dragged along the couple-hundred duplicate commits. It's also a bit annoying to do, since I have to do it for a not-yet-merged child node.
I found https://stackoverflow.com/a/973403/372757 , and think that it will work: I will merely need to rebase the old commits onto the new repository, once for each branch.
None the less, I would still like to know if my original request is possible.
Upvotes: 3
Views: 3195
Reputation: 15530
Your problem is re-defining commit equality. I think you should play with git cat-file
and grep
to filter the relevant information of a commit. Maybe the tree line is enough for you (say, git cat-file commit <COMMIT_ID> | grep "tree"
), but I think it would be good to include parent's trees too (not just commit's ID because they would differ).
Once you have this equality function, it would be matter of git rev-list
ing your repo, and doing some duplicates search on them, I think.
Upvotes: 1
Reputation: 62369
git
has a pretty strict definition of what an "identical commit" would be, that probably doesn't match what you're thinking. In order to be an identical commit, all of the following must be true:
All of these things are either directly or indirectly used in generating the SHA1 hash for the new commit, and thus a commit won't be identical unless it's truly identical.
That said, and I think possibly more to the point of your question, when generating a new commit, if a particular file or tree is byte-for-byte identical to what an object that is already in the database, because another commit had those things in exactly the same state, then the new commit will point to those already existing objects - they won't be stored again.
If it's only author information that differs in two branches (which will be a different sequence of commits, even if the file contents matched entirely with another branch), you can use git filter-branch
or git rebase
to rewrite a branch, fixing the information as you go, but that will result in a whole new set of commits (but all the trees and file objects can potentially stay the same, assuming you don't change anything other than commit messages, times, or author/committer names). Note however, that if other work (by yourself or others) is already based off the existing branch, there can be a significant amount of cleanup involved in making such changes.
Upvotes: 4