Reputation: 1002
In several instances, I do a git pull on a feature branch, and I end up having multiple annoying "merge commits", I could understand why they happen, but I wanted to consolidate them to appear like a normal commit.
I tried to use git rebase -i --rebase-merges HEAD~4
but could not figure out how to squash the merge commits.
I did research further, and after a lot of digging, I was able to do the following to consolidate unwanted merge commits into a normal commit using rebase and then squash them if needed:
git checkout feature
git pull # will create merge commits
git checkout featur_backup # to create a backup
git switch --orphan emty_commit
git commit -m "First empty commit for the feature branch" --allow-empty
git switch feature
git rebase empty_commit
git rebase -i --root # this allows you to squash commits
git branch -D empty_commit
Is there a better way to consolidate the merge commits?
Notes:
Upvotes: 1
Views: 5836
Reputation: 488203
Ôrel's answer has a recipe. You might not want origin/master
directly—what you do want is up to you and you may wish to think about this and experiment, once you've read through and are working out the bits of this explanation—but rebase does work.
I still don't understand how the git rebase you mentioned will work? My problem is that when I do a git pull on a feature branch, it'll end up having merge commit so how the
git rebase origin/master
will work?
The key here is to understand a number of things simultaneously. (This is often the case with Git.) You need to know:
git fetch
works;git merge
works;git pull
means run git fetch
, then run git merge
or some other second command (you have been using git merge
); andgit rebase
works.This is a lot of stuff! We cannot hope to cover it all, not even in one of my famously1 long answers, so we'll race through a few key items.
1Or other adverb of your choice.
A commit:
is numbered: it has a unique hash ID. That number means that commit, not only in your repository, but in every repository, even all the Git repositories that don't have your commit. (This is the main deep magic that powers Git.)
is read-only: no part of any commit can ever be changed. This is required by the magic numbering system.
contains two things: a full snapshot of all files (stored indirectly, in a special Git-ized form where they're compressed and de-duplicated, so it is a good thing when a new commit re-uses almost all the files from a previous commit: that means they take no space), plus metadata: information about the commit.
The metadata contains stuff like your name and email address and the date-and-time at which you made the commit. For Git's own purposes, Git stores, in the metadata in any one commit, a list of previous commit hash IDs. Most commits—"ordinary" commits, we tend to call them—have exactly one hash ID here. This forms the ordinary commits into a simple chain, except that the arrows connecting commits are backwards instead of forwards. Humans like to think of the arrows as going forwards, but that can't work in Git, because commits are strictly read-only.
What this means is that given a string of commits in a row, each with its own hash ID, we can draw that string of commits, newer commits towards the right, like this:
... <-F <-G <-H
Here H
stands for the hash ID of the last commit in the chain. Commit H
contains a full snapshot of every file, plus some metadata. In H
's metadata, Git has stored the hash ID of some earlier commit G
, which we (and Git) call the parent of commit H
. So commit H
points to earlier commit G
.
Of course, G
is also a commit, so it has a snapshot and metadata, and that metadata holds the hash ID of some still-earlier parent F
. But F
is a commit too, so F
points backwards as well. This goes on forever, or rather, until we get to the very first commit ever, which—being the first—has no parent (a weird sort of "virgin birth" as it were; Git calls it a root commit ).
The commits are all stored in a big database (of "all Git objects", including commit objects—the other kinds of objects are mostly supporting things for commits, namely tree and blob objects, plus annotated tag objects). This database is a simple key-value store where the keys are the hash IDs. So Git needs a hash ID to find a commit.
All of this has two important implications:
Commits don't store diffs. We get diffs—we see a commit as a change—by having Git compare adjacent commits. We pick some parent/child pair, such as G
and H
, and have Git compare the two snapshots. The de-duplication trick that Git uses makes it easy for Git to throw out all exactly-the-same files right away, so Git only has to figure out what changed in two files that are actually different. This usually does not take too long, so git show
or git log -p
can show a "patch" even though Git has stored only snapshots.
Git can find every commit on its own except the last one, using the parents stored in each commit. We have to tell Git the hash ID of commit H
. From there, Git works backwards all on its own. But we have to provide a raw hash ID here, and that's horrible for humans: the hash IDs look random and there's no way to figure one out. You'd have to memorize them, or write them down, or something.
To handle that last problem, Git provides, as a separate key-value database, one that's keyed by names: branch names, tag names, and many other kinds of names. In this database, the value associated with any given name is one hash ID. You get just one hash ID—not two or three or many, just one—but that's all we need, because we only need to store the hash ID of the latest commit, e.g., commit H
.
Since we say that commit H
"points to" its parent G
, we likewise say that a branch name points to the last commit in the branch. Git's term for this is that H
is the tip commit, and we can add it to our drawing like this:
...--G--H <-- main # or master or whatever
When we're "on" some branch, we attach a special name—HEAD
—to that branch name. (Git literally just stores the branch's name in a file in .git
named HEAD
, at least for the main working tree, though you're not supposed to depend on this in case a future version of Git comes up with a better / fancier way to do it.) This means that if we have multiple branch names, all pointing to commit H
—which is a perfectly normal thing to do in Git—then as we add commits, only the HEAD
branch-name gets updated. We might start with, e.g.:
...--G--H <-- feature (HEAD), main, zorg
and then make one new commit—it gets some new, unique hash ID, but we'll just call it "commit I
"—and that git commit
command makes Git store the new hash ID in the current branch name, so that we get:
...--G--H <-- main, zorg
\
I <-- feature (HEAD)
The special name HEAD
remains attached to the current branch name. The name feature
now selects I
as its tip commit; I
points backwards to H
, because H
was the tip of feature
when we made I
; and H
and G
and so on are all unchanged (they must be, because they are all read-only). The next commit updates the current branch name yet again:
...--G--H <-- main, zorg
\
I--J <-- feature (HEAD)
New commit J
points back to I
, which points back to H
, and so on. I
cannot change once we've made it: no part of any commit can ever change. Note that, while some people will refer to commits I-J
as "what's on feature
", in fact, all the commits are on feature
: it's just that commits up through and including H
are also on other branches, while I-J
are only on feature
at the moment.
git fetch
, or, commits are universal but branch names are notWhen we clone a repository, we copy all of its commits2 and none of its branch names. Instead of copying its branch names, we take each of their (the other repository's) branch names and change it into a remote-tracking name: their main
becomes our origin/main
, and their feature
becomes our origin/feature
, for instance. Then our Git will create one branch name in our new repository, that contains all of their commits and these modified names. The new branch name will match one of their branch names and will select the same tip commit as their name, so our picture might look like this:
...--G--H <-- main (HEAD), origin/main
\
I--J <-- origin/feature
depending on what they had in their Git repository when we ran git clone
to make our Git repository.
We can now create our own feature
name as well, and switch to it:
...--G--H <-- main, origin/main
\
I--J <-- feature (HEAD), origin/feature
and if and when we make new commits, they add on to our feature
. Our memory of their feature
still points to commit J
though:
...--G--H <-- main, origin/main
\
I--J <-- origin/feature
\
K--L <-- feature (HEAD)
These look exactly like branches, because they are exactly like branches, depending on what we mean by branch (see also What exactly do we mean by "branch"?). To the extent that "branch" means "set of commits found by starting from some name and working backwards", a remote-tracking name works just fine as a branch. (But it's also not a branch because you cannot git switch
to it. So is it a branch? That depends on what you mean by branch. This problem with the word branch is why you must be careful when saying "branch"—you may know what you mean, but will someone else? Are you even sure you know what you mean? 😀 Human cognition being as loose as it is, sloppy wording might result in bad conclusions.)
In any case, your remote-tracking names are reflects of their branch names, as of the last time your Git software called up their Git repository and asked them about those branch names. But if it's been a few months, or days, or hours, or perhaps even seconds, maybe their repository has changed. To resync, you run git fetch
:
So, after git fetch
, if they provided new commits, you might have:
...--G--H <-- main, origin/main
\
I--J----N--O <-- origin/feature
\
K--L <-- feature (HEAD)
Your feature
is now behind their feature
, as remembered by your origin/feature
, in that commits N
and O
are not on your feature
. Your feature
is also ahead of their feature
by two commits: your K-L
. But the "behind-ness" can be an issue.
2Well, all or mostly-all: we won't get into fine distinctions here. Just note that this is a bit of an over-generalization.
git merge
worksLet's drop back a bit and just think about your own individual repository. Suppose you have two branch names, br1
and br2
, that are arranged like this:
I--J <-- br1
/
...--G--H
\
K--L <-- br2
You'd like to use git merge
to tie these two branches together. You pick either one and switch to it, e.g., git switch br1
or git checkout br1
:
I--J <-- br1 (HEAD)
/
...--G--H
\
K--L <-- br2
and then you run git merge br2
. The end result of this is:
I--J
/ \
...--G--H M <-- br1 (HEAD)
\ /
K--L <-- br2
after which you're free to delete the name br2
because its tip commit L
is now find-able from commit M
. You don't have to delete the name, but you can: the point of branch names is to be able to find some tip commit, and that's only interesting if you plan to use that commit again, or add on to it, or both.
To make commit M
, Git has to combine work. This combining-work trick requires using git diff
twice, and making that work requires finding some suitable commit that's on both branches. In this particular case, it's easy to see that commit H
is on both br1
and br2
and is the most suitable commit: Git call this the merge base of the two branch tip commits. So Git now runs:
git diff --find-renames <hash-of-H> <hash-of-J> # what we changed on br1
git diff --find-renames <hash-of-H> <hash-of-L> # what they changed on br2
The merge code then combines the two sets of changes, applies the combined changes to commit H
—this adds their changes to our changes, or adds our changes to their changes, depending on how you want to look at it—and if all goes well, git merge
goes on to make merge commit M
.
What's special about M
is that it has two parents. That's the only thing special about M
! It is like any other commit in every other way: it has a snapshot of all files, and it has metadata. The snapshot contains the files from the merge base as modified by the combined changes. The metadata contains your name and email address as usual, and the current date-and-time as usual, and—this is the only special part—a list of two parent hash IDs, instead of just one.
This makes M
find both parent commits, which is why we can delete br2
. But we do not have to delete br2
. We can go on and make more commits on br2
if we like:
I--J
/ \
...--G--H M <-- br1
\ /
K--L---N--O <-- br2 (HEAD)
If we now switch back to br1
and run git merge br2
again, we'll get another merge commit:
I--J
/ \
...--G--H M------P <-- br1 (HEAD)
\ / /
K--L---N--O <-- br2
The merge base for the two diffs will be commit L
this time (this is less obvious) and the two tip commits will be M
and O
and that's the source for the files that go into new merge commit P
. The end result is that Git doesn't have to consider the changes in I-J
or K
again separately, nor start from H
: it gets to start from L
instead. More repeated work and merges result in yet more of the same:
...--M-----P-----T <-- br1 (HEAD)
/ / /
...--L--N--O--R--S <-- br2
The merge base for making T
was O
, with the two tip commits being P
and S
.
This is git merge
in a nutshell: we combine changes since some common starting point, found by inspecting the connections from commit-to-commit, and make a new merge commit with two parents, so that if there is a next merge, that next merge can start where we left off.
git pull
Once you properly understand commits, git fetch
, and git merge
, the default git pull
—which basically just runs git fetch
and then git merge
in that order—is simple. Of course git pull
wouldn't be what it is without a lot of extra options and features, but if you have not been using them, we can stop here.
I always recommend that Git newbies avoid git pull
because:
Fails is too strong, which is why I put it in quotes, but both git merge
and git rebase
can stop in the middle. Merge is much simpler than rebase (as we'll see in a moment), so it might well be the better default, but either way you must know how to clean up after a failure, or you're in trouble—and, unfortunately (and for bad reasons), the methods for cleaning up differ, so you need to know which one you're using. If you're new to Git, you won't even know which one you're running: your default depends on your configuration, and if you're working with a team (which is common) they may have asked you to set your default to use git rebase
.
If you run git fetch
, and then, separately, run your second command, you'll have a better idea what you're doing, and hence a better idea about how to get help. It's not really any easier—I had a colleague once who referred to this as pushing the "hard" around—and yet ... it's somehow easier. (Perhaps it's just more informative, but it does seem to help.)
git rebase
We often find, when using Git, that some set of commits we've made so far are ... well, they are okay, but not really what we want. We have a big problem here though: commits are literally impossible to change. Once made, they're set in stone.
But: what if we could copy some set of commits to some new-and-improved commits? Let's say we have:
...--G--H <-- main
\
I--J--K <-- feature (HEAD)
We're sort of done with our feature
but we've just noticed that there's a bug in I
, so we make one more commit to fix it:
...--G--H <-- main
\
I--J--K--L <-- feature (HEAD)
Now, the only reason we have L
at all is to fix the bug in I
. It sure would be nice if we hadn't put the bug into I
in the first place. And we can do that. We can't change commit I
, but we can make a new combined commit out of I
and L
—let's call it "commit IL
"—that comes after H
:
IL <-- temporary-branch (HEAD)
/
...--G--H <-- main
\
I--J--K--L <-- feature
We do this by creating a new branch name, temporary-branch
, that points to commit H
(same as main
), then making our new commit by whatever means (we'll come back to the "by whatever means" in a moment). Then we make another new commit J
, that's a duplicate of what J
did but as applied to IL
. We'll call this new commit J'
since it's so similar to J
:
IL-J' <-- temporary-branch (HEAD)
/
...--G--H <-- main
\
I--J--K--L <-- feature
Last, we copy K
to a new K'
:
IL-J'-K' <-- temporary-branch (HEAD)
/
...--G--H <-- main
\
I--J--K--L <-- feature
Our temporary branch now holds the three commits we wish we had made.
Now, hang on a moment: a while ago, we observed that we find commits by using the branch name to find the tip commit. Each branch name holds exactly one hash ID. What happens if we force the name feature
to select commit K'
instead of commit L
? We'll get this:
IL-J'-K' <-- feature, temporary-branch (HEAD)
/
...--G--H <-- main
\
I--J--K--L [abandoned]
Commits I-J-K-L
will still exist, but without a name by which to find them, we won't ever see them. We can now attach HEAD
to feature
again and delete the temporary branch name entirely:
IL-J'-K' <-- feature (HEAD)
/
...--G--H <-- main
\
I--J--K--L [abandoned]
It looks like we did everything right, right from the start.
Let's consider another scenario, where we make our I-J-K
commits perfectly, but someone else has come along and added a new L
to what we'd like as our main
. We get our own main
updated:
...--G--H--L <-- main
\
I--J--K <-- feature (HEAD)
and now the one thing we don't like about I-J-K
is that they come after H
: if they just came after L
, they'd be perfect. We can use the exact same trick, of making a temporary branch, copying the three commits that we do like, and making the name feature
point to the final tip commit:
I'-J'-K' <-- feature (HEAD)
/
...--G--H--L <-- main
\
I--J--K [abandoned]
The abandoned commits vanish from view (though they still exist in the repository) and it seems like we started our work after L
existed.
In both cases, what we're doing is relatively simple, though it has a lot of consequences: we're copying—and making some change or changes as we go, before they're set into stone—some existing commits to some sort of new-and-improved commits. Git provides one major "power tool" command for doing this, namely git rebase
.
The most basic form of git rebase
works by making no changes to what each commit changes. We use this one for the simple I-J-K
-on-H
becomes I'-J'-K'
-on-L
operation, for example. Each "copy one commit" step is done with a simple git cherry-pick
command.3
But there's a problem: cherry-pick literally can't copy a merge commit. If you try it, you get an error telling you that you must supply the -m
option. This option tells git cherry-pick
to pretend that the commit is an ordinary single-parent commit, instead of a merge.4 The original git rebase
solution to this problem is to ignore merge commits entirely.
That is, if we have:
I--J--M <-- feature (HEAD)
/ /
...--G--H--K--L <-- main
and we run git rebase main
, git rebase
ignores commit M
entirely. It copies just commits I-J
, putting the copies after L
:
I--J--M [abandoned]
/ /
...--G--H--K--L <-- main
\
I'-J' <-- feature (HEAD)
It turns out that this is often what we want anyway.
When it's not what we want, then we have the new-in-Git-2.18 git rebase --rebase-merges
option. This doesn't try to copy the merges at all. Instead, it copies the commits leading up to the merge, then runs git merge
anew to make a new merge commit, then resumes copying more commits if needed. But since we want to drop merges, we get to ignore this newfangled option.
We also have git rebase --interactive
, which offers the ability to do squash and fixup operations, and to move commits around. Each pick
command in an instruction sheet from git rebase --interactive
corresponds to an individual git cherry-pick
command. Changing one to squash
means that instead of doing a normal cherry-pick-and-commit, the previous operation should do a cherry-pick but not commit yet, and then this commit should be added in, and then we'll commit.
All of these fancy options are just that: fancy options that add on to the basic idea, that we're going to copy some commits. Fundamentally, then, git rebase
needs to know two things:
The git rebase
command by default cleverly combines both questions into one. We start with the observation that when we run git merge
, we're always merging into the current branch. Similarly, when we run git rebase
, we're always copying commits from the current branch.5 The problem with just saying "well, we want the commits that are on the current branch" is that there are probably hundreds or even thousands of commits on the current branch. The current branch goes all the way back to the very first commit! We don't want to copy all those commits. We want to copy a selected subset of those commits.
In the example above, for instance, the selected subset of commits we want to copy is I-J-M
, except that M
is a merge, so we don't want to copy it after all. That leaves I-J
as the commits to copy.
Git has a funky syntax, X..Y
, that means all commits reachable from Y
but not reachable from X
. This "reachable from" notion is another thing you should know about Git: see, e.g., Think Like (a) Git. In our case, main..feature
produces exactly the right list of commits: I
, J
, and then M
.6
Since we're always going to copy commits from the current branch, git rebase
doesn't need you to write down the name feature
. It just needs you to type in the name main
, so that it can construct main..feature
itself and get the list of commits to copy. And—this is the clever part—the name main
selects the right place to put the copies too! We want I'-J'
to go after commit L
, and the name main
selects commit L
.
So we just run:
git rebase main
or perhaps git rebase -i main
to get the interactive variety, and Git knows:
main..feature
, andmain
and that's all we need!
Occasionally this doesn't quite work. For these cases, git rebase
has --onto
. We run:
git rebase --onto <place> <what-not-to-copy>
and the listing of commits to copy uses *what-not-to-copy..HEAD
to figure out what not to copy. That is, for the X
part of X..Y
, the part that we don't say --onto
for is the X
part. Then the where to put the copies is the --onto
part, and that's how we run git rebase
.
3This is true in modern Git. In older versions, some rebases used cherry-pick and some didn't, and in positively ancient Git, all rebases didn't. Clearly cherry-picking works better than the other methods, and that's the idea to keep in mind here.
4The argument to -m
itself is the parent number to use for this pretending action, and most often it's just the constant 1
, for reasons we won't cover as we don't have space for it here.
5There's a variant of git rebase
that makes it run git switch
first, to switch to some other branch. What you need to know is that this git switch
is done as if you wrote it as a separate command. Then, after this git switch
works (assuming it does work), git rebase
works using the current branch. When the whole thing is done, you wind up on the branch you switched-to, just as if you had run git switch
yourself.
Because of this "pretend you ran a separate command first" aspect, I recommend avoiding this, at least until you're really familiar with everything else. It's somewhat similar to git pull
: if you don't realize that this is the same as running separate commands, you might be surprised to find that git switch br2 && git rebase --onto x y z
leaves you on branch z
, not on br2
. Your git switch br2
was kind of pointless since this was equivalent to git switch br2 && git switch z && git rebase --onto x y
.
6Actually, like everything else in Git, it produces them in the wrong order: backwards. For cherry-picking, we need them to go in the wrong-for-Git, forwards, order. So the rebase code uses --reverse
internally. It also uses --topo-order
and --no-merges
, though again we won't discuss these details for space reasons.
git rebase
There are some times you shouldn't use git rebase
. These mostly boil down to the fact that rebase works by copying. Remember that git fetch
itself also works by copying commits, and git clone
works by copying commits. We have our branch names, and they—some other Git repostory—have theirs, and we copy some commits and adjust our branch names to find the new copies.
We can't force someone else to adjust their branch names to point to new copied commits instead of the originals. We can ask them (politely) with git push
, and we can command them with git push --force
, but they can reject even the forceful git push
. And, even if we get one other Git repository to switch its branch names to point to new-and-improved commits, what if there are other clones that still have the old-and-lousy commits, and are storing the hash IDs in branch names?
So we generally should not use rebase to improve commits if someone else is using the old ones. We can do it, and if we have pre-approval from all "someone else"-s involved, that's all fine too. But we're guaranteed to be safe if we rebase commits that we have never let anyone else see, because in that case, there's nobody else using the old ones.
In general, then, it's "safe" to rebase our own branch names and commits if we haven't used git push
to send those commits to someone else. It's still safe, even if we have used git push
this way, if we're sure nobody else depends on the old commit hash IDs. And it's even safe if someone does depend on those if we make arrangements with that someone-else in advance.
All the rest is mere (mere?) management: keeping track of who has which commits. It's sad that Git doesn't do a better job of this, but Git's idea of commits being permanently immutable closes a lot of possibilities. (I think Mercurial's "evolve" extension made use of a few mutable bits stuck into their commits: they defined a commit format where the hash ID doesn't cover every single bit of the commit. There are some other distributed database techniques available here, but Git doesn't use those either.)
Upvotes: 10
Reputation: 7622
To rebase by default on git pull
git config --global pull.rebase true
or to rebase on specific git pull
git pull --rebase
Using git fetch
Fetch the remote status
git fetch
Then if you want to rebase
git rebase origin/main
or if you want to merge
git merge origin/main
If you did a merge you didn't want to do Cancel the merge commit and the modification it brings with:
git reset --hard HEAD~1
then rebase
git rebase origin/main
Upvotes: 3