Reputation: 1164
I'm trying to better understand the magic behind git-rebase. I was very pleasantly surprised today by the following behavior, which I didn't expect.
TLDR: I rebased a shared branch, causing all commit sha1s to change. Despite this, a derived branch was able to accurately identify that its original commits were "aliased" into new commits with different sha1s. The rebase didn't create any mess at all.
Take a master branch: M1
Branch it off into branch-X, with some additional commits added: M1-A1-B1-C1
.
Note down the git-log output.
Branch off branch-X into branch-Y, with one additional commit added: M1-A1-B1-C1-D1
. Note down the git-log output.
Add a new commit to the tip of the master branch: M1-M2
Rebase branch-X onto the updated master: M1-M2-A2-B2-C2
. Note that A2-B2-C2, all have the same message, contents and author-date as A1-B1-C1. However, they have completely different sha1 values, as well as commit dates. According to this writeup, the reason the SHA1 is different is because the commit's parent has changed.
Rebase branch-Y onto the updated branch-X. Result: M1-M2-A2-B2-C2-D2
.
Notably only the D1 commit is applied (and becomes D2). The A1-B1-C1 commits in branch-Y are completely ignored by git-rebase. You can see this in the output logs.
This is wonderful, but how does git-rebase know to ignore A1-B1-C1? How does git-rebase know that A2-B2-C2 are the same as A1-B1-C1, and hence, can be safely ignored? I had always assumed that git keeps track of commits using the sha1 identifier, but despite the above commits having different sha1s, git still somehow knows that they are linked together. How does it do that? Given the above behavior, when is it truly dangerous to rebase a shared branch?
Upvotes: 7
Views: 1031
Reputation: 16577
Internally, git rebase
lists commits that should be rebased, and then computes a patch-id for these commits. Unlike the commit id, it only hashes the content of the patch, not the content of the tree and commit objects. So, A1 and A2, while having different identifiers, have the same patch-id. Then, git rebase
skips patches whose patch-id is already present.
For more information, search patch-id
here: https://git-scm.com/book/en/v2/Git-Branching-Rebasing
Relevant section from above (diagrams missing):
If someone on your team force pushes changes that overwrite work that you’ve based work on, your challenge is to figure out what is yours and what they’ve rewritten.
It turns out that in addition to the commit SHA-1 checksum, Git also calculates a checksum that is based just on the patch introduced with the commit. This is called a “patch-id”.
If you pull down work that was rewritten and rebase it on top of the new commits from your partner, Git can often successfully figure out what is uniquely yours and apply them back on top of the new branch.
For instance, in the previous scenario, if instead of doing a merge when we’re at Someone pushes rebased commits, abandoning commits you’ve based your work on we run git rebase teamone/master, Git will:
- Determine what work is unique to our branch (C2, C3, C4, C6, C7)
- Determine which are not merge commits (C2, C3, C4)
- Determine which have not been rewritten into the target branch (just C2 and C3, since C4 is the same patch as C4')
- Apply those commits to the top of teamone/master
This only works if C4 and C4' that your partner made are almost exactly the same patch. Otherwise the rebase won’t be able to tell that it’s a duplicate and will add another C4-like patch (which will probably fail to apply cleanly, since the changes would already be at least somewhat there).
Upvotes: 13
Reputation: 489283
There are in fact several different method git rebase
uses to eliminate redundant copies.
The first, and safest, one is via the same method that git cherry
uses to identify cherry-picked commits. If you read the linked documentation, though, the only clue as to how this works is at the end, where the manual page links to the git patch-id
documentation.
Reading this second manual page will give you a good idea of how "commit equivalence" gets established: Git simply computes a git patch-id
on the output from, e.g., git show
of any ordinary (non-merge) commit. Really, it runs git diff-tree
rather than the user-oriented git show
, but the effect is about the same.
But there's still something missing, and it's very poorly documented in either of git rebase
or git cherry
. It's documented somewhat better in git rev-list
, which is a rather daunting manual page. There are two keys: the notion of symmetric difference, using the three-dot syntax described in the gitrevisions documentation, and the --left-right
and --cherry-mark
options to git rev-list
.
Once you understand how we take a DAGlet such as:
...--o--o--L1--L2--L3 <-- left
\
R1--R2--R3 <-- right
and use left...right
to select the three L
and R
commits, the --left-right
option itself makes lots of sense: it marks which commits in the text output are from the left side of the three dots, and which are right-side commits.
The second step here is discovering that git rev-list
can compute the patch ID for each commit on each "side". Git can then compare all the left-side patch IDs with all the right-side patch-IDs. The --cherry-mark
option, and its related options, use these to mark equivalent or inequivalent commits, or to omit equivalent commits.
The final piece to this particular puzzle is that git rebase
does not, as the documentation claims, use <upstream>..HEAD
. Instead, it uses the equivalent of git rev-list --cherry-pick --right-only --no-merges <upstream>...HEAD
to get the set of commits to copy. (To these options we must also add --topo-order
and --reverse
.)
The second method git rebase
uses to elide commits is the --fork-point
mechanism now built into git merge-base
. This mechanism is particularly tricky to describe, and furthermore, relies on reflog entries to know about commits that were on a branch in the past, but are no longer. It also gives an undesirable result sometimes, and is not useful in this particular kind of rebase.
I mainly mention it here because someone looking for reasons that git rebase
left out some commit(s) might have come across a case where the fork-point mechanism has misfired. See, e.g.:
Upvotes: 5
Reputation: 970
The branch-Y commits are empty upon the second rebase
There is really no magic hidden inside. Rebase searches for common history and ignores it (only commit M1 in this case). Detaches the history from rebased branch (Y) and tries to pick it on the new base (branch-X).
The picking method derives a patch from a previous and picked commit. As it is empty for A1, B1 and C1, it simply skips these commits. Only D1 is then picked and therefore a D2 is created (with new SHA as the parent link in header changes; as correctly stated in the question).
Upvotes: 1