Reputation: 1

How come rebasing to a branch that has rebased introduces merge conflicts and seems to erase my current changes?

I am working with some others in a git repository, where a large change was made on branch a. The branch was basically finished and was waiting for approval, so when I started implementing my new feature I branched off it rather than master. I am working on branch b. There were some minor changes to branch a that I moved into branch b with a merge (though in retrospect I should have rebased).

After that, a change in branch c was merged into master. branch b was rebased to master.

I want to rebase to branch a to get these changes in my branch also, but when I do so I get several confusing conflicts in files I haven't touched. Where I have changed the files, what I see in in the conflict does not reflect my changes. Instead, it seems as though I am resolving branch a and master, only branch a is labelled as HEAD and master as branch a. I'm confused.

How did this happen, and what can I best do?

If I merge again rather than rebasing I do see my own changes - though there are still conflicts in files I haven't changed, and the conflicts span many lines that haven't been touched on branch a since I branched off it. I can to resolve this alright - but I was hoping not to repeat my earlier mistake.

I was expecting just to need to resolve the 3-4 lines that changed on branch c in the files I modified.

Upvotes: 0

Answers (1)

torek

Reputation: 490118

Fundamentally, these problems occur because—well, primarily because—rebase works by copying commits. The questions is which commits get copied—what, where, when, and why: four of the five Ws. Who may also be interesting, and for the sixth factor, how ... well, we'll see that in a moment.

Fixing the mess is most likely just a matter of copying just the right commits now, which you might be able to do with git rebase, or which may be easier with git cherry-pick.

Background

To understand this properly, you need to have a good grasp of precisely what a commit is, how commits get found, and how commits get copied. For a proper in-depth explanation of how commits get found, see Think Like (a) Git, but we'll touch on it lightly here.

What a commit is comes in two parts: data—a snapshot of your source—and metadata, such as your name and email address and other information about the commit:

The snapshot is just that: a complete copy of every source file, however it stands at the time you make the commit. The snapshot is actually made from the index, rather than from the source you can see, and this is why you keep having to git add files over and over again: git add somefile really means copy file somefile from the version I can see and work with, replacing whatever copy may be in the index now with this improved one. Running git commit snapshots the index copy.
The metadata includes the other stuff that git log can show: your name and email address (from your configured user.name and user.email); the date-and-time-stamp of when you made the commit; and most important to humans, but unimportant to Git, the reason you made the commit: the log message. But it also includes one more item—or sometimes, multiple items—that are crucially important to Git: the hash ID of each of the parent commits of this new commit.

Every commit is uniquely identified by its hash ID. No two commits can ever share a hash ID. The hash ID appears to be completely random, but in fact it's completely non-random: it's a cryptographic checksum of the contents of the commit, including the source snapshot, your name and email, the date and time—the time helps makes the commit unique even if you sneakily commit the same code twice, using the same name and email and log message—and (somehow, but this isn't hard after all) the hash ID(s) of the same parent(s) as well.

Because hash IDs are cryptographic checksums, neither you nor anyone else—not even Git—can change anything about a commit once it's committed. If you take an existing commit, extract it, fiddle with it at all, and make a new commit, then either you've preserved every single bit of the original commit and hence haven't changed it and it's still just the original commit, or, you get a new and different commit with a completely different hash ID. What this means is that in a sense, the hash ID is the commit.

Meanwhile, each commit stores the hash ID of its parent(s). With any simple linear chain of commits, this means we can start with the last commit and use it to find each earlier commit. By representing commits as single uppercase letters instead of real hash IDs, we can draw this as:

... <-F <-G <-H

where H is the hash ID of the last commit in some branch. Moreover, this works for multiple branches as well: we just need to save, somewhere, the hash ID of each final commit:

...--F--G--H   <-- somehow remember H
         \
          I--J   <-- somehow remember J

This is where Git's branch names come in. What a branch name does is remember the hash ID of one commit. If master remembers H and dev remembers J, then we have some series of shared commits on both branches—everything through commit G is shared—and one commit private to master and two private to dev.

Copying one commit with `git cherry-pick`

If a commit is a snapshot—and it is—how can git log or git show show it as a diff? How can you copy one commit to a new, slightly different commit? The answer lies in those same parent connections.

Suppose that commit J is actually an important bug fix, having nothing to do with the development going on on dev. We'd like to copy the fix in J into master to make a new commit K, giving us:

...--F--G--H--K   <-- master
         \
          I--J   <-- dev

What we need Git to do is to take the snapshots in I and J and compare them to each other. Whatever changed from I to J, that's the bug-fix we need to apply to H.

The git cherry-pick command does this. It does this using Git's internal merge machinery so that any other differences between I and H can be maintained correctly, but we'll ignore the details here: in simple cases, the merge has no complicated work to do, and we can just imagine that Git just applies the I-vs-J changes to H to produce the copy of J that we'll call K.

That gives us just what we wanted: K and J do the same thing, so K is sort of a copy of J, but they do it to a different starting point, so K's parent is H (on master) and K itself can be safely kept only on master. Because K is a copy of J, Git will by default use the same author and log message and even the same time-stamp. You, and now, are the committer of new copy K, but whoever made J is the author. Of course, J and K have different hash IDs—they're different commits—but K is a copy of J.

Rebase is just an en-masse copy

Suppose we start instead with this:

...--F--G--H   <-- master
      \
       I--J--K   <-- dev

This time there are no important bug fixes. We'd just like to take the three commits that are on dev and make them start from H, rather than from F.

What we can do at this point is create a new dev2 branch, pointing to commit H, like this:

$ git checkout -b dev2 master

giving:

...--F--G--H   <-- master, dev2 (HEAD)
      \
       I--J--K   <-- dev

The special name HEAD tells us (and Git) which branch name gets updated as we go, so HEAD is attached to our new dev2.

Now we can, one at a time, copy commit I, then J, then K. Let's call the copies I', J', and K' this time:

             I'-J'-K'   <-- dev2 (HEAD)
            /
...--F--G--H   <-- master
      \
       I--J--K   <-- dev

Now let's delete the name dev entirely:

$ git branch -D dev

             I'-J'-K'   <-- dev2 (HEAD)
            /
...--F--G--H   <-- master
      \
       I--J--K   ???

and then rename dev2 to dev:

$ git branch -m dev2 dev

             I'-J'-K'   <-- dev (HEAD)
            /
...--F--G--H   <-- master

With no name by which to find them, original commits I-J-K seem to be gone. Since the name by which we find the three new commits is dev, it seems as though we have magically replaced the original three commits. We haven't—the original three are actually still in the repository, and will normally remain there for a while in case we change our minds and want them back. But normal Git commands won't see them. They're hidden away, so that we only see the new and improved dev branch.

This—copying commits, as if by git cherry-pick, and then re-shuffling branch names—is what git rebase does. But suppose, unbeknownst to whoever is doing this reshuffling of dev, you, in your Git repository, have made several more commits that depend on commit K? If you were to grab their stuff while they have a dev2, you would see this:

             I'-J'-K'   <-- origin/dev2
            /
...--F--G--H   <-- master, origin/master
      \
       I--J--K   <-- dev, origin/dev
              \
               L--M--N   <-- branch-b

If you wait until they've replaced their dev—and in general, you have to—then your own repository gets updated like this:

             I'-J'-K'   <-- origin/dev
            /
...--F--G--H   <-- master, origin/master
      \
       I--J--K   <-- dev
              \
               L--M--N   <-- branch-b

Now suppose you either delete your dev, or—more likely—update it to match theirs. What you see in your repository is:

             I'-J'-K'   <-- dev, origin/dev
            /
...--F--G--H   <-- master, origin/master
      \
       I--J--K--L--M--N   <-- branch-b

This means that you now have not three but six commits on your branch-b that they—the people running the origin/* names—don't have.

As far as Git is concerned, all six commits, I-J-K-L-M-N, are your commits.

If you rebase your branch-b onto your/their master right now, Git will try to copy all six commits. If you rebase onto their origin/dev, something more interesting happens, which we'll come back to in a moment.

Over time, they might merge their work into master, and/or you might merge their work into yours. Here's one potential graph, if they've merged their K' into their master (your origin/master and your now-updated master) and then you merged that O commit into your branch-b to produce merge P:

             I'--J'--K'   <-- dev, origin/dev
            /         \
...--F--G--H-----------O   <-- master, origin/master
      \                 \
       I--J--K--L--M--N--P   <-- branch-b

Whenever you rebase, your Git enumerates commits to copy

When you run git rebase target, your Git has to figure out which commits to copy. The list of commits to copy starts with all commits reachable from your current branch—hence, commits I-J-K-L-M-N-P but also F-G-H-...-O-P due to the fact that P is a merge commit. Here, again, I suggest working through Think Like (a) Git.

From this big list of all reachable commits, Git immediately subtracts all commits reachable from the target. So if you chose, say, dev as your target, Git would eliminate commits F-G-H-I'-J'-K'. If you chose master, Git would eliminate F-G-H-I'-J'-K'-O. These are obvious to eliminate from the list because they're already there in the target.

Git also omits any merge commits, just automatically. So that knocks out O and P. Merge commits literally cannot be copied: copying a commit requires comparing it to its (single) parent, and merge commits have two parents. There's no way for Git to know which parent to use, so it just doesn't copy them.

But that still leaves I-J-K-L-M-N as the list of commits to (maybe) copy. This is where Git's idea of how to identify an already-copied commit comes in.

For each commit that's in the upstream "already there, don't copy" list—which includes I'-J'-K'—Git computes a patch ID, using the git patch-id program. This essentially reduces each commit to an approximation of the change from the parent of that commit, to that commit. That is, Git finds the same set of changes that git cherry-pick would copy, and computes a hash ID from those changes.

Then, Git computes patch-IDs for all of your commits I-J-K-L-M-N. If any of those patch-IDs match the patch-IDs for the upstream commits, Git knocks those commits out. In many cases, this completely rescues your rebase and everything just works. Your original I-J-K commits—well, yours now, even though they weren't yours originally—are left over in your branch because they copied theirs to I'-J'-K' and then abandoned their I-J-K, but your branch-b retained them even though they thought they had left I-J-K behind forever. But if your I-J-K have patch-IDs that match their I'-J'-K', your rebase will work well.

If their patch-IDs don't match, though, now you have a problem. Git can't automatically exclude the copied commits. The nearly-but-not-exactly-the-same commits will clash when Git goes to apply them at their new place for the rebase. These will be files you never touched, but whose changes you inherited in the preserved commits that they—whoever "they" are—thought they had successfully abandoned. You're bringing back their work as your own, even though it's not something you wanted.

What you'll need to do to recover

The way to recover from all of this is to figure out which commits should get copied. If you're lucky, they may all be in a neat simple row. In this case you can use git rebase --onto to split the job into two parts. Normally you run:

git rebase <target>

and target specifies both the place to put the copies, and the set of commits to avoid copying. But you can run:

git rebase --onto <target> <dontcopy>

and now target only specifies where to put the copies; the dontcopy argument tells Git what not to copy. Those same reachability rules from Think Like (a) Git determine which commits will and won't be put in the "maybe copy" list. Rebase will then throw out all merges and all same-patch-ID commits too, and copy whatever remains.

If you're not lucky, the set of commits to copy will be sprinkled about, hither and yon. You'll have to create a new temporary branch, or use "detached HEAD" mode, and run a series of git cherry-pick commands to copy the commits that should get copied, the ones that are truly yours.

Merging avoids this problem because merging doesn't copy commits

When people use git merge to combine work, that leaves the original commits undisturbed. That way, if someone else—like you!—are using those original commits in some way, Git knows those commits are properly incorporated, because each merge records both parents of the merge operation. The graph—the connecting lines going from later commits back to earlier ones—shows the true history of commits and development, and Git can figure everything out.

When people use git rebase to combine work, they discard their original commits in favor of new and (supposedly) improved copies of the original commits. If everyone else knows to deal with this problem right away, everyone else who's sharing work with this repository can handle this before it becomes a huge mess. Or, if no one else has ever seen these commits, you—the person doing this git rebase right now—can be sure that no one else is using your originals. When you abandon them in favor of new and improved commits, you are not stranding someone else who's using your commits, because no one else has your commits.

But in this case, you've been stranded: you were using someone else's commits, then they ripped them out (in favor of new-and-improved rebased commits), and they thought everyone was done with them. But you were and are still using them and now Git thinks these are your commits and is dragging them around with you wherever you go. Now, after all this time, now you have to rid yourself of these commits, and the only way you can do that is by selective copying.

Upvotes: 1