Reputation: 3328
My PR to a branch on github shows commits as the following:
- "commit msg 1"
- "commit msg 2"
- "Merge remote-tracking branch 'upstream/dev' into this branch."
- "commit msg 3"
- "commit msg 4"
- "Merge remote-tracking branch 'upstream/dev' into this branch."
I want to rebase this branch and squash all the four commits with messages "commit msg *"
into a single commit.
First I tried:
git rebase -i <commit id of the first commit>
It showed me a history containing many other commits, which are introduced as the results of merging the upstream/dev
; showing an output as the following:
- pick "commit msg1"
- pick "someone else's commit 1"
- pick "someone else's commit 2"
- pick "someone else's commit 3"
- pick "commit msg2"
- pick "someone else's commit 4"
- pick "someone else's commit 5"
...
I tried setting pick
to f
for all the commits, and after resolving merge conflicts, it showed all the changes made in upstream/dev
in my branch, as if I am re-implementing them.
I tried: - https://stackoverflow.com/a/5201642 - https://stackoverflow.com/a/5190323
I know I can try merge --squash
(e.g., https://stackoverflow.com/a/5309051/947889), but that creates a separate branch.
The example commits here are simplified for clarity, the actual branch contains ~250 commits, and when using rebase, it shows ~300,000 commits, which makes sense since it is a big feature implemented on an active repository over the course of 2+ years.
Any suggestions on how best I can rebase this branch to a single commit?
Upvotes: 1
Views: 196
Reputation: 487755
You almost certainly do want git merge --squash
, but done with a detached HEAD followed by a branch-movement operation, e.g.:
$ git checkout upstream/master
$ git merge --squash yourbranch
$ git checkout -B yourbranch # or git branch -f yourbranch HEAD
(but see the long answer below).
I know I can try merge --squash (e.g., https://stackoverflow.com/a/5309051/947889), but that creates a separate branch.
Using git merge --squash
does not create a separate branch (it just creates a commit). But it would not matter if it did, because Git's branches are basically meaningless: you can change or rearrange your branch names any time and way you like. What matters in Git is not branches—or more precisely, branch names—but rather commits. A Git repository is a collection of commits, plus some auxiliary information. The auxiliary information can be changed. The commits cannot. The branch names are part of this changeable auxiliary information.
Each commit has its own unique big ugly hash ID. These hash IDs are the real names of the commits. Every commit is completely, totally read-only. You cannot change any existing commit. The commit's hash ID means that commit, and no other commit. But the thing about these hash IDs is that they appear to be completely random. How will you ever find the right hash IDs?
Well, for one thing, each commit stores the hash ID of some set of other, earlier commits. These are the parent commits of this one particular commit. Most commits store exactly one parent hash ID.
When one commit stores the hash ID of some other, earlier commit, we say that the later commit points to the earlier commit. (Note that no commit can store the hash ID of a later commit, as the hash ID of that later commit does not exist yet at the time the earlier commit is being created, and once created, no commit can ever be changed.) So when you have a long line of commits that were created one after another, each one points backwards to the previous commit. If we draw this out—using uppercase letters to stand in for real commit hash IDs—we get a picture that looks like this:
... <-F <-G <-H
Here H
is the latest commit, with some hash ID H. The actual commit itself, which is now frozen for all time, contains the raw hash ID of earlier commit G
. Commit G
contains the raw hash ID of earlier commit F
, which in turn contains another earlier commit hash ID, and so on.
This is where branch names come in. A branch name simply contains the hash ID of the last commit we want to say is "on the branch". So if H
is to be the last commit on some branch, we just put its hash ID into the branch name. This name now points to commit H
, just like H
points to G
:
...--F--G--H <-- branch1
We can now create another branch name, such as base
, and make it point to existing commit F
(using git branch base <hash-of-F>
):
...--F <-- base
\
G--H <-- branch1
We had to draw the commit graph—the F-G-H
lines—a bit differently to squeeze in the name, but the commits are completely unchanged by this action. All we did was make a new name that points to F
, so that F
is the last commit on branch base
. The name branch1
still identifies H
, so H
is the last commit on branch branch1
.
Let's remove base
(git branch -d base
), and make a new name feature
that points to H
too. We'll make sure to git checkout feature
as well, so that HEAD
is attached to the name feature
:
...--F--G--H <-- branch1, feature (HEAD)
(We can do this with git checkout -b feature branch1
, for instance.) Now we'll make a new commit, in the usual way. This new commit gets a new, unique hash ID, but let's just call it I
. What Git does now is move the name feature
so that it points to new commit I
. The parent of new commit I
is H
, so that I
points back to H
:
...--F--G--H <-- branch1
\
I <-- feature (HEAD)
This is what branches are: they are just names, or labels, that point to specific commits, with the special trick that when you git checkout
one of them, you not only get that commit ready to work with, you also arrange for the next git commit
operation to update the name.
Nearly everything you do in Git is about creating, or obtaining, commits, and then making various names point to particular ones of these newly created or obtained commits. A branch name just lets you find some particular commit. By definition, the name is the last commit on that branch. It doesn't matter how you move the branch name around. Whenever that name exists, it points to some commit. That commit is the last commit on the branch. Move the name and you've changed which commit is the last commit on the branch. You haven't changed any of the commits—they are all still there—you have merely changed which commit is the last one for that branch.
What git rebase
does is to copy some set of commits, then move the branch name. Consider, for instance, this graph:
...--F--G--H <-- master
\
I--J <-- feature
You start by doing a git checkout
of the name you'd like to move-after-copying:
$ git checkout feature
You then run a git rebase
command. It takes some arguments, which the documentation calls the --onto
and upstream
parameters. These specify a target commit, which is where the copies should go, and also which commits should get copied:
$ git rebase master
You can give just one argument, as here—git rebase master
—in which case both the target commit and the set of commits are found using that one name. Here, the target commit is commit H
and the set of commits to copy are commits I
and J
.
The rebase command now copies each commit, as if by using git cherry-pick
. The copies get new hash IDs. There are a lot of fiddly corner cases here, for which you can use options to git rebase
, but in this case it is simple and we will end up with a copy of I
that has a new and different hash ID, which we'll call I'
, and a copy of J
that we'll call J'
. There are two big differences between I
and I'
, and the one we can see in the drawing here is that I'
's parent is not G
but H
. The same goes for commit-copy J'
:
I'-J' <-- HEAD (detached)
/
...--F--G--H <-- master
\
I--J <-- feature
(The difference you can't see in this drawing is that the snapshot saved with commit I'
is probably different from that saved with I
, because cherry-pick effectively takes the change from G
to I
and applies that change to the snapshot in H
, rather than to G
.)
Having copied these two commits, rebase finishes by moving the branch name:
I'-J' <-- feature (HEAD)
/
...--F--G--H <-- master
\
I--J [abandoned]
What happens to commits I
and J
? The answer is a little complicated, but all we need for now is: nothing, yet. Git keeps them around for a while, in case you decide the rebase was a bad idea. But they have become hard to find. New commit J'
is easy to find: the name feature
finds it. Commit I'
is easy to find: we just go to J'
and then follow its backwards-pointing arrow to I'
. But what's the hash ID of commit J
? It was in the name feature
, but it's not anymore. If you can find J
, you can use it to find I
, but unless you have saved the hash ID of J
somewhere, this may be a bit tricky.1 Eventually—typically some time after 30 days from now—Git will reclaim them entirely as unneeded, if you haven't used some other name—some other branch or tag name, for instance—to make sure they stick around.
1For right now, it's really easy to find: Git saves it under the name ORIG_HEAD
. But other commands will replace ORIG_HEAD
's hash ID. There is a second way to find it, though, using git reflog
, and that's what keeps the commit around for at least another month or so by default.
A merge commit has two (or more) parents. When we have Git follow the backwards-pointing links from commits to their parents, it generally follows all the links. So from a merge commit, Git goes down both paths.
Your real commit graph is not simple at all, but your question-example commit graph could be not too bad. It might look like this:
Y--Z <-- upstream/master
/
...--o--o--o--W--o--o--o--X <-- upstream/dev
\ \ \
A--B-----M--C--D--N <-- yourbranch (HEAD)
Here, the six commits that are "on" (reachable from) yourbranch
that are not reachable from upstream/dev
, i.e., commit X
, are A-B-M-C-D-N
. Let's look at this more closely:
If we start at commit X
and work backwards, the commits we'll find are all the unlettered ones between W
and X
, plus W
, plus all the unlettered ones before W
.
If we start at yourbranch
—commit N
—and work backwards, we visit commit X
(via the link from N
) and commit D
(via the link from N
). From D
we get to C
, then to M
, then to both B
and some unnamed commit. We get to that unnamed commit from X
too.
If we start at origin/master
, or commit Z
, we'll visit Z
, then Y
, then W
, and then all those unnamed commits before W
.
So if we run git rebase
like this:
git checkout yourbranch
git rebase upstream/master
your Git will list out all the commits reachable from N
that are not reachable from Z
. You used upstream/master
as both your target (--onto
) and your upstream ("don't copy commits reachable from Z
"). That means Git won't copy commits W
and earlier—they're reachable from Z
—but will copy the o-o-o-X
commits as well A-B-C-D
. Rebase normally discards all merge commits, so it will toss out both M
and N
, but you're left with eight commits being copied instead of four.
One thing you could do about this is run:
git rebase --onto upstream/master upstream/dev
This separates the what not to copy argument from the where to put the copies argument. We still tell rebase: put the copies after commit Z
, but this time, we tell rebase: *don't copy commits reachable from X
. So Git lists out commits A-B-M-C-D-N
as the commits to copy, then tosses M
and N
because they're merges, and is left with the job of copying A-B-C-D
.
If all goes well with this rebase, you'll be left with this:
A'-B'-C'-D' <-- yourbranch (HEAD)
/
Y--Z <-- upstream/master
/
...--o--o--o--W--o--o--o--X <-- upstream/dev
\ \ \
A--B-----M--C--D--N [abandoned]
You can now create a pull request from this.
I want to [end up with] a single commit.
That is, if we draw the desired result, it might look like this:
ABCD <-- yourbranch (HEAD)
/
Y--Z <-- upstream/master
/
...--o--o--o--W--o--o--o--X <-- upstream/dev
\ \ \
A--B-----M--C--D--N [abandoned]
where ABCD
is a single commit that has the effect you'd get if you first rebased, then did a second rebase to squash them all down to one commit.
To get there, you can use this sequence of commands:
$ git checkout upstream/master
$ git merge --squash yourbranch
$ git checkout -B yourbranch # or git branch -f yourbranch HEAD
The first git checkout
gets you a detached HEAD pointing to the commit identified by upstream/master
, i.e., commit Z
. If you prefer, you can use a temporary branch name:
$ git checkout -b temp upstream/master
This gives you:
Y--Z <-- upstream/master, HEAD
/
...--o--o--o--W--o--o--o--X <-- upstream/dev
\ \ \
A--B-----M--C--D--N <-- yourbranch
or:
Y--Z <-- upstream/master, temp (HEAD)
/
...--o--o--o--W--o--o--o--X <-- upstream/dev
\ \ \
A--B-----M--C--D--N <-- yourbranch
The git merge --squash
builds a new non-merge commit with the content you want:
ABCD <-- HEAD
/
Y--Z <-- upstream/master
/
...--o--o--o--W--o--o--o--X <-- upstream/dev
\ \ \
A--B-----M--C--D--N <-- yourbranch
(or the same drawing with HEAD
attached to the name temp
, which has now moved to point to ABCD
).
The last step is to yank (yoink?) the name yourbranch
off commit N
and make it point to new commit ABCD
, which is where git branch -f
or git checkout -B
come in. The main difference between these two is whether HEAD
is attached to yourbranch
afterward:
ABCD <-- HEAD, yourbranch
/
Y--Z <-- upstream/master
/
...--o--o--o--W--o--o--o--X <-- upstream/dev
\ \ \
A--B-----M--C--D--N [abandoned]
or:
ABCD <-- yourbranch (HEAD)
/
Y--Z <-- upstream/master
/
...--o--o--o--W--o--o--o--X <-- upstream/dev
\ \ \
A--B-----M--C--D--N [abandoned]
(there are some minor other differences in terms of what ends up in the reflogs for HEAD
and yourbranch
, but we have not really covered reflogs here).
(I won't go into how git merge --squash
does its work, as this is already pretty long.)
Upvotes: 2