sfletche
sfletche

Reputation: 49714

Why would git cherry-pick produce fewer conflicts than git rebase?

I rebase often. Occasionally the rebase is particularly problematic (lots of merge conflicts) and my solution in such cases is to cherry-pick the individual commits onto master branch. I do this because nearly every time I do, the number of conflicts is considerably less.

My question is why this would be the case.

Why are there fewer merge conflicts when I cherry-pick than when I rebase?

In my mental model a rebase and a cherry-pick are doing the same thing.

Rebase example

A-B-C (master)
   \
    D-E (next)

git checkout next
git rebase master

produces

A-B-C (master)
     \
      D`-E` (next)

and then

git checkout master
git merge next

produces

A-B-C-D`-E` (master)

Cherry pick example

A-B-C (master)
   \
    D-E (next)

git checkout master 
git cherry-pick D E

produces

A-B-C-D`-E` (master)

From my understanding the end result is the same. (D and E are now on master with a clean (straight-line) commit history.)

Why would the latter (cherry picking) ever produce fewer merge conflicts than the former (rebasing)?

UPDATE UPDATE UPDATE

I was finally able to reproduce this problem and I realize now that I may have oversimplified the example above. Here's how I was able to reproduce...

Say I have the following (notice the extra branch)

A-B-C (master)
   \
    D-E (next)
       \
        F-G (other-next)

And then I do the following

git checkout next
git rebase master
git checkout master
git merge next

I end up with the following

A-B-C-D`-E` (master)
   \ \
    \ D`-E` (next)
     \
      D-E
         \
          F-G (other-next)

From here, I'll either rebase or cherry-pick

Rebasing example

git checkout other-next
git rebase master 

produces

A-B-C-D`-E`-F`-G` (master)

Cherry picking example

git checkout master
git cherry-pick F G

produces the same result

A-B-C-D`-E`-F`-G` (master)

but with far fewer merge conflicts than the rebasing strategy.

Having finally reproduced a similar example I think I see why there were more merge conflicts with the rebasing than with the cherry picking, but I'll leave it for someone else (who will likely do a better (and more accurate) job than I would) to answer.

Upvotes: 13

Views: 2463

Answers (1)

torek
torek

Reputation: 487883

Updated answer (see update in question)

I think what's happening here has to do with choosing the commits to copy.

Let's note, and then put aside, the fact that git rebase may use either git cherry-pick, or git format-patch and git am, to copy some commits. In most cases git cherry-pick and git am should achieve the same results. (The git rebase documentation specifically calls out upstream file renames as an issue for the cherry-pick method, vs the default git am-based method for non-interactive rebase. See also various parenthetical remarks in original answer below, and comments.)

The main thing to consider here is which commits are to be copied. In the manual method, you first manually copy commits D and E to D' and E', then you manually copy F and G to F' and G'. This is the minimal amount of work to do and is just what we want; the only drawback here is all the manual commit-identifying we have to do.

When you use the command:

git checkout <branch> && git rebase <upstream>

you make Git automate the process of finding commits to copy. This is great when Git gets it right, but not if Git gets it wrong.

So how does Git choose these commits? The simple, but somewhat wrong, answer is in this sentence (from the same documentation):

All changes made by commits in the current branch but that are not in <upstream> are saved to a temporary area. This is the same set of commits that would be shown by git log <upstream>..HEAD; or by git log 'fork_point'..HEAD, if --fork-point is active (see the description on --fork-point below); or by git log HEAD, if the --root option is specified.

The --fork-point complication is somewhat new, since git 2.something, but it's not "active" in this case because you specified an <upstream> argument and did not specify --fork-point. The actual <upstream> is master both times.

Now, if you actually run each git log (with --oneline to make it nicer):

git checkout next && git log --oneline master..HEAD

and:

git checkout other-next && git log --oneline master..HEAD

you will see that the first one lists commits D and E—excellent!—but the second one lists D, E, F, and G. Uh oh, D and E occur twice!

The thing is, this sometimes works. Well, I said "somewhat wrong" above. Here's what makes it wrong, just two paragraphs down from the earlier quote:

Note that any commits in HEAD which introduce the same textual changes as a commit in HEAD..<upstream> are omitted (i.e., a patch already accepted upstream with a different commit message or timestamp will be skipped).

Note that HEAD..<upstream> here is the reverse of the <upstream>..HEAD in the git log commands we just ran, where we saw D-through-G.

For the first rebase, there are no commits in git log HEAD..master, so there are no commits that could possibly get skipped. That's good, because there are no commits to skip: we're copying E and F to E' and F', and that's just what we want.

For the second rebase, though, which happens after the first rebase is done, git log HEAD..master will show you commits E' and F': the two copies we just made. These are potentially skipped: they are candidates to consider skipping.

"Potentially skipped" is not "really skipped"

So how does Git decide which commits that it should really skip? The answer is in git patch-id, although it's actually implemented directly in git rev-list, which is a very fancy and complicated command. Neither of these really describes it terribly well, though, in part because it is hard to describe. Here's my attempt anyway. :-)

What Git does here is look at the diffs, after stripping off identifying line numbers, in case the patches go in slightly different locations (due to earlier patches moving lines up and down in files). It uses the same tricks it uses with files—turning file contents into unique hashes—to turn each commit into a "patch ID". The commit ID is a unique hash that identifies one specific commit, and always that same one specific commit. The patch ID is a different (but still unique-to-some-content) hash ID that always identifies "the same" patch, i.e., something that removes and adds the same diff-hunks, even if it removes and adds them from different locations.

Having computed a patch ID for every commit, Git can then say: "Aha, commit D and commit D' have the same patch-ID! I should skip copying D because D' is probably a result of copying D." It can do the same for E vs E'. This often works—but it fails for D whenever the copy from D to D' required manual intervention (fixing merge conflicts), and it likewise fails for E whenever the copy from E to E' required manual intervention.

A smarter rebase

What's needed here is a sort of "smart rebase" that can look at a series of branches and compute, in advance, which commits to copy, once, for all the to-be-rebased branches. Then, after all the copies are done, this "smart rebase" would adjust all the branch-names.

In this particular case—copying D through G—it's actually pretty easy, and you can do this manually with:

$ git checkout -q other-next && git rebase master
[here rebase copies D, E, F, and G, perhaps with your assistance]

followed by:

$ git checkout next
[here git checks out "next", so that HEAD is ref: refs/heads/next
 and refs/heads/next points to original commit E]
$ git reset --hard other-next~2

This works because other-next names commit G', whose parent is F', whose parent in turn is E', and this is where we want next to point. Since HEAD refers to branch next, git reset adjusts refs/heads/next to point to commit E', and we're done.

In more complex cases, the commits that need to be copied-exactly-once are not all neatly linear:

                A1-A2-A3  <-- featureA
               /
...--o--o--o--o--o--o--o   <-- master
         \
          *--*--B3-B4-B5   <-- featureB
              \
               C3-C4       <-- featureC

If we want to "multi-rebase" all three features, we can rebase featureA independently of the other two—none of the three A commits depend on anything "non-master" other than earlier A commits—but to copy the five B commits and the four C commits, we must copy the two * commits that are both B and C, but copy them just once, and then copy the remaining three and two commits (respectively) onto the tip of the copied commit.

(It would be possible to write such a "smart rebase", but integrating that into Git properly, so that git status truly understands it, is considerably harder.)


Original answer

I'd love to see a reproducible example. In most cases your "in-head" model should work. There is one known special case though.

An interactive rebase, or adding -m or --merge to plain git rebase, actually does use git cherry-pick, while the default non-interactive rebase uses git format-patch and git am instead. The latter is not as good for rename detection. In particular, if there is a file rename in the upstream,1 the interactive or --merge rebase can be expected to behave differently (usually, better).

(Also, note that both kinds of rebase—both the patch-oriented one and the cherry-pick based version—will skip commits that are git patch-id-identical to commits already in the upstream, via git rev-list --left-only --cherry-pick HEAD...<upstream> or equivalent. See the documentation for git rev-list, particularly the section on --cherry-mark and --left-right, which I think makes this more comprehensible. This should be the same for both kinds of rebase, though; if you are manually cherry-picking, it will be up to you whether you do this.)


1More precisely, git diff --find-renames needs to believe there is a rename there. Usually it believes this if there is one, but since it's detecting them by comparing trees, this is not perfect.

Upvotes: 12

Related Questions