tarekahf
tarekahf

Reputation: 1002

How to rebase a feature branch with multiple merge commits due to git pull to be able to squash them

In several instances, I do a git pull on a feature branch, and I end up having multiple annoying "merge commits", I could understand why they happen, but I wanted to consolidate them to appear like a normal commit.

I tried to use git rebase -i --rebase-merges HEAD~4 but could not figure out how to squash the merge commits.

I did research further, and after a lot of digging, I was able to do the following to consolidate unwanted merge commits into a normal commit using rebase and then squash them if needed:

git checkout feature
git pull  # will create merge commits
git checkout featur_backup  # to create a backup
git switch --orphan emty_commit
git commit -m "First empty commit for the feature branch" --allow-empty
git switch feature
git rebase empty_commit
git rebase -i --root  # this allows you to squash commits
git branch -D empty_commit

Is there a better way to consolidate the merge commits?

Notes:

Upvotes: 1

Views: 5836

Answers (2)

torek
torek

Reputation: 488203

Ôrel's answer has a recipe. You might not want origin/master directly—what you do want is up to you and you may wish to think about this and experiment, once you've read through and are working out the bits of this explanation—but rebase does work.

I still don't understand how the git rebase you mentioned will work? My problem is that when I do a git pull on a feature branch, it'll end up having merge commit so how the git rebase origin/master will work?

The key here is to understand a number of things simultaneously. (This is often the case with Git.) You need to know:

  • how individual commits work;
  • how branch names work;
  • how git fetch works;
  • how git merge works;
  • that git pull means run git fetch, then run git merge or some other second command (you have been using git merge); and
  • how git rebase works.

This is a lot of stuff! We cannot hope to cover it all, not even in one of my famously1 long answers, so we'll race through a few key items.


1Or other adverb of your choice.


Commits, branches, and branch names

A commit:

  • is numbered: it has a unique hash ID. That number means that commit, not only in your repository, but in every repository, even all the Git repositories that don't have your commit. (This is the main deep magic that powers Git.)

  • is read-only: no part of any commit can ever be changed. This is required by the magic numbering system.

  • contains two things: a full snapshot of all files (stored indirectly, in a special Git-ized form where they're compressed and de-duplicated, so it is a good thing when a new commit re-uses almost all the files from a previous commit: that means they take no space), plus metadata: information about the commit.

The metadata contains stuff like your name and email address and the date-and-time at which you made the commit. For Git's own purposes, Git stores, in the metadata in any one commit, a list of previous commit hash IDs. Most commits—"ordinary" commits, we tend to call them—have exactly one hash ID here. This forms the ordinary commits into a simple chain, except that the arrows connecting commits are backwards instead of forwards. Humans like to think of the arrows as going forwards, but that can't work in Git, because commits are strictly read-only.

What this means is that given a string of commits in a row, each with its own hash ID, we can draw that string of commits, newer commits towards the right, like this:

... <-F <-G <-H

Here H stands for the hash ID of the last commit in the chain. Commit H contains a full snapshot of every file, plus some metadata. In H's metadata, Git has stored the hash ID of some earlier commit G, which we (and Git) call the parent of commit H. So commit H points to earlier commit G.

Of course, G is also a commit, so it has a snapshot and metadata, and that metadata holds the hash ID of some still-earlier parent F. But F is a commit too, so F points backwards as well. This goes on forever, or rather, until we get to the very first commit ever, which—being the first—has no parent (a weird sort of "virgin birth" as it were; Git calls it a root commit ).

The commits are all stored in a big database (of "all Git objects", including commit objects—the other kinds of objects are mostly supporting things for commits, namely tree and blob objects, plus annotated tag objects). This database is a simple key-value store where the keys are the hash IDs. So Git needs a hash ID to find a commit.

All of this has two important implications:

  • Commits don't store diffs. We get diffs—we see a commit as a change—by having Git compare adjacent commits. We pick some parent/child pair, such as G and H, and have Git compare the two snapshots. The de-duplication trick that Git uses makes it easy for Git to throw out all exactly-the-same files right away, so Git only has to figure out what changed in two files that are actually different. This usually does not take too long, so git show or git log -p can show a "patch" even though Git has stored only snapshots.

  • Git can find every commit on its own except the last one, using the parents stored in each commit. We have to tell Git the hash ID of commit H. From there, Git works backwards all on its own. But we have to provide a raw hash ID here, and that's horrible for humans: the hash IDs look random and there's no way to figure one out. You'd have to memorize them, or write them down, or something.

To handle that last problem, Git provides, as a separate key-value database, one that's keyed by names: branch names, tag names, and many other kinds of names. In this database, the value associated with any given name is one hash ID. You get just one hash ID—not two or three or many, just one—but that's all we need, because we only need to store the hash ID of the latest commit, e.g., commit H.

Since we say that commit H "points to" its parent G, we likewise say that a branch name points to the last commit in the branch. Git's term for this is that H is the tip commit, and we can add it to our drawing like this:

...--G--H   <-- main     # or master or whatever

When we're "on" some branch, we attach a special name—HEAD—to that branch name. (Git literally just stores the branch's name in a file in .git named HEAD, at least for the main working tree, though you're not supposed to depend on this in case a future version of Git comes up with a better / fancier way to do it.) This means that if we have multiple branch names, all pointing to commit H—which is a perfectly normal thing to do in Git—then as we add commits, only the HEAD branch-name gets updated. We might start with, e.g.:

...--G--H   <-- feature (HEAD), main, zorg

and then make one new commit—it gets some new, unique hash ID, but we'll just call it "commit I"—and that git commit command makes Git store the new hash ID in the current branch name, so that we get:

...--G--H   <-- main, zorg
         \
          I   <-- feature (HEAD)

The special name HEAD remains attached to the current branch name. The name feature now selects I as its tip commit; I points backwards to H, because H was the tip of feature when we made I; and H and G and so on are all unchanged (they must be, because they are all read-only). The next commit updates the current branch name yet again:

...--G--H   <-- main, zorg
         \
          I--J   <-- feature (HEAD)

New commit J points back to I, which points back to H, and so on. I cannot change once we've made it: no part of any commit can ever change. Note that, while some people will refer to commits I-J as "what's on feature", in fact, all the commits are on feature: it's just that commits up through and including H are also on other branches, while I-J are only on feature at the moment.

git fetch, or, commits are universal but branch names are not

When we clone a repository, we copy all of its commits2 and none of its branch names. Instead of copying its branch names, we take each of their (the other repository's) branch names and change it into a remote-tracking name: their main becomes our origin/main, and their feature becomes our origin/feature, for instance. Then our Git will create one branch name in our new repository, that contains all of their commits and these modified names. The new branch name will match one of their branch names and will select the same tip commit as their name, so our picture might look like this:

...--G--H   <-- main (HEAD), origin/main
         \
          I--J   <-- origin/feature

depending on what they had in their Git repository when we ran git clone to make our Git repository.

We can now create our own feature name as well, and switch to it:

...--G--H   <-- main, origin/main
         \
          I--J   <-- feature (HEAD), origin/feature

and if and when we make new commits, they add on to our feature. Our memory of their feature still points to commit J though:

...--G--H   <-- main, origin/main
         \
          I--J   <-- origin/feature
              \
               K--L   <-- feature (HEAD)

These look exactly like branches, because they are exactly like branches, depending on what we mean by branch (see also What exactly do we mean by "branch"?). To the extent that "branch" means "set of commits found by starting from some name and working backwards", a remote-tracking name works just fine as a branch. (But it's also not a branch because you cannot git switch to it. So is it a branch? That depends on what you mean by branch. This problem with the word branch is why you must be careful when saying "branch"—you may know what you mean, but will someone else? Are you even sure you know what you mean? 😀 Human cognition being as loose as it is, sloppy wording might result in bad conclusions.)

In any case, your remote-tracking names are reflects of their branch names, as of the last time your Git software called up their Git repository and asked them about those branch names. But if it's been a few months, or days, or hours, or perhaps even seconds, maybe their repository has changed. To resync, you run git fetch:

  • Your Git software calls up their Git software.
  • They list out their branch names and commit hash IDs.
  • Your Git—your software working on your repository—checks to see if you have those commits (i.e., if you have those hash IDs). For ones you are missing, your Git brings those over, along with all needed parents and grandparents and so on. Now you have all your own commits, plus any new ones they had that you didn't.
  • Last, your Git updates your remote-tracking names with the new memories. (Note that this step can be suppressed, and ancient versions of Git didn't do it as well as modern Git.)

So, after git fetch, if they provided new commits, you might have:

...--G--H   <-- main, origin/main
         \
          I--J----N--O   <-- origin/feature
              \
               K--L   <-- feature (HEAD)

Your feature is now behind their feature, as remembered by your origin/feature, in that commits N and O are not on your feature. Your feature is also ahead of their feature by two commits: your K-L. But the "behind-ness" can be an issue.


2Well, all or mostly-all: we won't get into fine distinctions here. Just note that this is a bit of an over-generalization.


How git merge works

Let's drop back a bit and just think about your own individual repository. Suppose you have two branch names, br1 and br2, that are arranged like this:

          I--J   <-- br1
         /
...--G--H
         \
          K--L   <-- br2

You'd like to use git merge to tie these two branches together. You pick either one and switch to it, e.g., git switch br1 or git checkout br1:

          I--J   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

and then you run git merge br2. The end result of this is:

          I--J
         /    \
...--G--H      M   <-- br1 (HEAD)
         \    /
          K--L   <-- br2

after which you're free to delete the name br2 because its tip commit L is now find-able from commit M. You don't have to delete the name, but you can: the point of branch names is to be able to find some tip commit, and that's only interesting if you plan to use that commit again, or add on to it, or both.

To make commit M, Git has to combine work. This combining-work trick requires using git diff twice, and making that work requires finding some suitable commit that's on both branches. In this particular case, it's easy to see that commit H is on both br1 and br2 and is the most suitable commit: Git call this the merge base of the two branch tip commits. So Git now runs:

git diff --find-renames <hash-of-H> <hash-of-J>   # what we changed on br1
git diff --find-renames <hash-of-H> <hash-of-L>   # what they changed on br2

The merge code then combines the two sets of changes, applies the combined changes to commit H—this adds their changes to our changes, or adds our changes to their changes, depending on how you want to look at it—and if all goes well, git merge goes on to make merge commit M.

What's special about M is that it has two parents. That's the only thing special about M! It is like any other commit in every other way: it has a snapshot of all files, and it has metadata. The snapshot contains the files from the merge base as modified by the combined changes. The metadata contains your name and email address as usual, and the current date-and-time as usual, and—this is the only special part—a list of two parent hash IDs, instead of just one.

This makes M find both parent commits, which is why we can delete br2. But we do not have to delete br2. We can go on and make more commits on br2 if we like:

          I--J
         /    \
...--G--H      M   <-- br1
         \    /
          K--L---N--O   <-- br2 (HEAD)

If we now switch back to br1 and run git merge br2 again, we'll get another merge commit:

          I--J
         /    \
...--G--H      M------P   <-- br1 (HEAD)
         \    /      /
          K--L---N--O   <-- br2

The merge base for the two diffs will be commit L this time (this is less obvious) and the two tip commits will be M and O and that's the source for the files that go into new merge commit P. The end result is that Git doesn't have to consider the changes in I-J or K again separately, nor start from H: it gets to start from L instead. More repeated work and merges result in yet more of the same:

  ...--M-----P-----T   <-- br1 (HEAD)
      /     /     /
...--L--N--O--R--S   <-- br2

The merge base for making T was O, with the two tip commits being P and S.

This is git merge in a nutshell: we combine changes since some common starting point, found by inspecting the connections from commit-to-commit, and make a new merge commit with two parents, so that if there is a next merge, that next merge can start where we left off.

git pull

Once you properly understand commits, git fetch, and git merge, the default git pull—which basically just runs git fetch and then git merge in that order—is simple. Of course git pull wouldn't be what it is without a lot of extra options and features, but if you have not been using them, we can stop here.

I always recommend that Git newbies avoid git pull because:

  • you have too many choices of second command, and the default-default (merge) is not always suitable;
  • the second command (merge vs rebase) makes a huge difference; and
  • the second command often "fails".

Fails is too strong, which is why I put it in quotes, but both git merge and git rebase can stop in the middle. Merge is much simpler than rebase (as we'll see in a moment), so it might well be the better default, but either way you must know how to clean up after a failure, or you're in trouble—and, unfortunately (and for bad reasons), the methods for cleaning up differ, so you need to know which one you're using. If you're new to Git, you won't even know which one you're running: your default depends on your configuration, and if you're working with a team (which is common) they may have asked you to set your default to use git rebase.

If you run git fetch, and then, separately, run your second command, you'll have a better idea what you're doing, and hence a better idea about how to get help. It's not really any easier—I had a colleague once who referred to this as pushing the "hard" around—and yet ... it's somehow easier. (Perhaps it's just more informative, but it does seem to help.)

git rebase

We often find, when using Git, that some set of commits we've made so far are ... well, they are okay, but not really what we want. We have a big problem here though: commits are literally impossible to change. Once made, they're set in stone.

But: what if we could copy some set of commits to some new-and-improved commits? Let's say we have:

...--G--H   <-- main
         \
          I--J--K   <-- feature (HEAD)

We're sort of done with our feature but we've just noticed that there's a bug in I, so we make one more commit to fix it:

...--G--H   <-- main
         \
          I--J--K--L   <-- feature (HEAD)

Now, the only reason we have L at all is to fix the bug in I. It sure would be nice if we hadn't put the bug into I in the first place. And we can do that. We can't change commit I, but we can make a new combined commit out of I and L—let's call it "commit IL"—that comes after H:

          IL   <-- temporary-branch (HEAD)
         /
...--G--H   <-- main
         \
          I--J--K--L   <-- feature

We do this by creating a new branch name, temporary-branch, that points to commit H (same as main), then making our new commit by whatever means (we'll come back to the "by whatever means" in a moment). Then we make another new commit J, that's a duplicate of what J did but as applied to IL. We'll call this new commit J' since it's so similar to J:

          IL-J'  <-- temporary-branch (HEAD)
         /
...--G--H   <-- main
         \
          I--J--K--L   <-- feature

Last, we copy K to a new K':

          IL-J'-K'  <-- temporary-branch (HEAD)
         /
...--G--H   <-- main
         \
          I--J--K--L   <-- feature

Our temporary branch now holds the three commits we wish we had made.

Now, hang on a moment: a while ago, we observed that we find commits by using the branch name to find the tip commit. Each branch name holds exactly one hash ID. What happens if we force the name feature to select commit K' instead of commit L? We'll get this:

          IL-J'-K'  <-- feature, temporary-branch (HEAD)
         /
...--G--H   <-- main
         \
          I--J--K--L   [abandoned]

Commits I-J-K-L will still exist, but without a name by which to find them, we won't ever see them. We can now attach HEAD to feature again and delete the temporary branch name entirely:

          IL-J'-K'  <-- feature (HEAD)
         /
...--G--H   <-- main
         \
          I--J--K--L   [abandoned]

It looks like we did everything right, right from the start.

Let's consider another scenario, where we make our I-J-K commits perfectly, but someone else has come along and added a new L to what we'd like as our main. We get our own main updated:

...--G--H--L   <-- main
         \
          I--J--K   <-- feature (HEAD)

and now the one thing we don't like about I-J-K is that they come after H: if they just came after L, they'd be perfect. We can use the exact same trick, of making a temporary branch, copying the three commits that we do like, and making the name feature point to the final tip commit:

             I'-J'-K'  <-- feature (HEAD)
            /
...--G--H--L   <-- main
         \
          I--J--K   [abandoned]

The abandoned commits vanish from view (though they still exist in the repository) and it seems like we started our work after L existed.

In both cases, what we're doing is relatively simple, though it has a lot of consequences: we're copying—and making some change or changes as we go, before they're set into stone—some existing commits to some sort of new-and-improved commits. Git provides one major "power tool" command for doing this, namely git rebase.

The most basic form of git rebase works by making no changes to what each commit changes. We use this one for the simple I-J-K-on-H becomes I'-J'-K'-on-L operation, for example. Each "copy one commit" step is done with a simple git cherry-pick command.3

But there's a problem: cherry-pick literally can't copy a merge commit. If you try it, you get an error telling you that you must supply the -m option. This option tells git cherry-pick to pretend that the commit is an ordinary single-parent commit, instead of a merge.4 The original git rebase solution to this problem is to ignore merge commits entirely.

That is, if we have:

          I--J--M   <-- feature (HEAD)
         /     /
...--G--H--K--L   <-- main

and we run git rebase main, git rebase ignores commit M entirely. It copies just commits I-J, putting the copies after L:

          I--J--M   [abandoned]
         /     /
...--G--H--K--L   <-- main
               \
                I'-J'  <-- feature (HEAD)

It turns out that this is often what we want anyway.

When it's not what we want, then we have the new-in-Git-2.18 git rebase --rebase-merges option. This doesn't try to copy the merges at all. Instead, it copies the commits leading up to the merge, then runs git merge anew to make a new merge commit, then resumes copying more commits if needed. But since we want to drop merges, we get to ignore this newfangled option.

We also have git rebase --interactive, which offers the ability to do squash and fixup operations, and to move commits around. Each pick command in an instruction sheet from git rebase --interactive corresponds to an individual git cherry-pick command. Changing one to squash means that instead of doing a normal cherry-pick-and-commit, the previous operation should do a cherry-pick but not commit yet, and then this commit should be added in, and then we'll commit.

All of these fancy options are just that: fancy options that add on to the basic idea, that we're going to copy some commits. Fundamentally, then, git rebase needs to know two things:

  • What commits would you like to copy (and then abandon the original versions)?
  • Where would you like to place these copies?

The git rebase command by default cleverly combines both questions into one. We start with the observation that when we run git merge, we're always merging into the current branch. Similarly, when we run git rebase, we're always copying commits from the current branch.5 The problem with just saying "well, we want the commits that are on the current branch" is that there are probably hundreds or even thousands of commits on the current branch. The current branch goes all the way back to the very first commit! We don't want to copy all those commits. We want to copy a selected subset of those commits.

In the example above, for instance, the selected subset of commits we want to copy is I-J-M, except that M is a merge, so we don't want to copy it after all. That leaves I-J as the commits to copy.

Git has a funky syntax, X..Y, that means all commits reachable from Y but not reachable from X. This "reachable from" notion is another thing you should know about Git: see, e.g., Think Like (a) Git. In our case, main..feature produces exactly the right list of commits: I, J, and then M.6

Since we're always going to copy commits from the current branch, git rebase doesn't need you to write down the name feature. It just needs you to type in the name main, so that it can construct main..feature itself and get the list of commits to copy. And—this is the clever part—the name main selects the right place to put the copies too! We want I'-J' to go after commit L, and the name main selects commit L.

So we just run:

git rebase main

or perhaps git rebase -i main to get the interactive variety, and Git knows:

  • which commits to copy: main..feature, and
  • where to point the copies: after the commit named by main

and that's all we need!

Occasionally this doesn't quite work. For these cases, git rebase has --onto. We run:

git rebase --onto <place> <what-not-to-copy>

and the listing of commits to copy uses *what-not-to-copy..HEAD to figure out what not to copy. That is, for the X part of X..Y, the part that we don't say --onto for is the X part. Then the where to put the copies is the --onto part, and that's how we run git rebase.


3This is true in modern Git. In older versions, some rebases used cherry-pick and some didn't, and in positively ancient Git, all rebases didn't. Clearly cherry-picking works better than the other methods, and that's the idea to keep in mind here.

4The argument to -m itself is the parent number to use for this pretending action, and most often it's just the constant 1, for reasons we won't cover as we don't have space for it here.

5There's a variant of git rebase that makes it run git switch first, to switch to some other branch. What you need to know is that this git switch is done as if you wrote it as a separate command. Then, after this git switch works (assuming it does work), git rebase works using the current branch. When the whole thing is done, you wind up on the branch you switched-to, just as if you had run git switch yourself.

Because of this "pretend you ran a separate command first" aspect, I recommend avoiding this, at least until you're really familiar with everything else. It's somewhat similar to git pull: if you don't realize that this is the same as running separate commands, you might be surprised to find that git switch br2 && git rebase --onto x y z leaves you on branch z, not on br2. Your git switch br2 was kind of pointless since this was equivalent to git switch br2 && git switch z && git rebase --onto x y.

6Actually, like everything else in Git, it produces them in the wrong order: backwards. For cherry-picking, we need them to go in the wrong-for-Git, forwards, order. So the rebase code uses --reverse internally. It also uses --topo-order and --no-merges, though again we won't discuss these details for space reasons.


When not to use git rebase

There are some times you shouldn't use git rebase. These mostly boil down to the fact that rebase works by copying. Remember that git fetch itself also works by copying commits, and git clone works by copying commits. We have our branch names, and they—some other Git repostory—have theirs, and we copy some commits and adjust our branch names to find the new copies.

We can't force someone else to adjust their branch names to point to new copied commits instead of the originals. We can ask them (politely) with git push, and we can command them with git push --force, but they can reject even the forceful git push. And, even if we get one other Git repository to switch its branch names to point to new-and-improved commits, what if there are other clones that still have the old-and-lousy commits, and are storing the hash IDs in branch names?

So we generally should not use rebase to improve commits if someone else is using the old ones. We can do it, and if we have pre-approval from all "someone else"-s involved, that's all fine too. But we're guaranteed to be safe if we rebase commits that we have never let anyone else see, because in that case, there's nobody else using the old ones.

In general, then, it's "safe" to rebase our own branch names and commits if we haven't used git push to send those commits to someone else. It's still safe, even if we have used git push this way, if we're sure nobody else depends on the old commit hash IDs. And it's even safe if someone does depend on those if we make arrangements with that someone-else in advance.

All the rest is mere (mere?) management: keeping track of who has which commits. It's sad that Git doesn't do a better job of this, but Git's idea of commits being permanently immutable closes a lot of possibilities. (I think Mercurial's "evolve" extension made use of a few mutable bits stuck into their commits: they defined a commit format where the hash ID doesn't cover every single bit of the commit. There are some other distributed database techniques available here, but Git doesn't use those either.)

Upvotes: 10

&#212;rel
&#212;rel

Reputation: 7622

To rebase by default on git pull

git config --global pull.rebase true

or to rebase on specific git pull

git pull --rebase

Using git fetch

Fetch the remote status

git fetch

Then if you want to rebase

git rebase origin/main

or if you want to merge

git merge origin/main

If you did a merge you didn't want to do Cancel the merge commit and the modification it brings with:

git reset --hard HEAD~1

then rebase

git rebase origin/main

Upvotes: 3

Related Questions