Reputation: 63
My development team recently moved to using Gitlab. I had suggested that on merge requests we need to squash commits. I got a lot of push back that it would not be safe.
A typical feature development cycle includes daily commits and pushes on a feature branch. It also includes once daily git pulls from master to stay in sync with other changes. Sometimes this involves resolving merge conflicts.
They believe all of this merging from master and then performing a merge request to master and squashing commits could result in possible undetected merge problems.
Is there any truth to this? My assumption is that squashing commits should be safe no matter what.
Upvotes: 5
Views: 5802
Reputation: 489163
At the risk of repeating (some or even all of) what is in bk2204's answer, I'll take a stab at this.
(Note that there's a problem with any answer, because you used the word safe. That's not a term with a clear and obvious technical definition. Something totally safe in one situation might be less-so in another. Analogies always fall apart when pushed too hard, but consider, e.g., walking down the street without wearing, hand, knee, and elbow protection, vs zooming down the street on rollerblades—or should I say "inline skates"?—without said protection.)
I always like to say that Git is not about files but rather about commits (which then hold files). That holds in this case as well:
git merge --squash
(or its equivalent clicky button on some web interface) are different from the commits made by git merge
(or its equivalent clicky button).In this case, the entire difference lies in the parent linkage of the final commit. On its face, this might seem like a tiny little difference ... but the way commits link up, in Git, is history, in Git. There's no such thing as file history: there only on commit history. So this small difference is actually huge—especially when you consider how squash-merges should, in general, affect peoples' future behaviors.
To see how this works in practice, it's necessary to understand the commit graph. This gets us into the metadata I mentioned in the first bullet point above.
I find it helpful, sometimes, to look at the actual, concrete content of a real commit. Here's a commit in the Git repository for Git itself:
$ git rev-parse HEAD
08da6496b61341ec45eac36afcc8f94242763468
$ git cat-file -p 08da6496b61341ec45eac36afcc8f94242763468 | sed 's/@/ /'
tree 27fee9c50528ef1b0960c2c342ff139e36ce2076
parent 07f25ad8b235c2682a36ee76d59734ec3f08c413
author Junio C Hamano <gitster pobox.com> 1570770961 +0900
committer Junio C Hamano <gitster pobox.com> 1570771489 +0900
Eighth batch
Signed-off-by: Junio C Hamano <gitster pobox.com>
(I replaced the @
with space just to maybe, possibly, cut down on the amount of spam that goes to pobox.com. Note that you can actually just git cat-file -p HEAD
, too: I use the raw hash ID here to help cement the idea that it's the hash ID that matters.)
This is actually the full and complete text of the commit. The hash ID, 08da6496b61341ec45eac36afcc8f94242763468
, is a cryptographic checksum of this stuff.1 Git stores the snapshot itself indirectly, as a tree object,2 which has its own hash ID. The remainder of the text is the rest of the metadata: author and committer (name, email address, and date-and-time stamp); parent commits, listed by hash ID; and the log message.
So, each commit has its own unique hash ID. That hash ID refers to that commit; no other commit can ever have the same hash ID.3 Moreover, each commit refers, by hash ID, to its immediate parent commit(s). These parent hash IDs are required to be the IDs of some commits that existed before this commit was made. For ordinary (non-merge) commits, the (single) parent is whichever commit you had checked out at the time you made the commit.
When we look at a new Git repository with just three commits in it, we can draw this out by using single uppercase letters to stand in for the actual hash IDs:
A <-B <-C
Here, commit A
is the first commit we ever made. It has no parent, because the parent is the commit that existed before commit A
, and there isn't one. This special case is what Git calls a root commit.4 But then once we had commit A
, we made commit B
, so B
's parent is A
. Similarly, we were using commit B
when we made commit C
, so C
's parent is B
.
When something holds the hash ID of a commit, we say that that thing points to the commit. So commit C
points to B
, which points to A
, which is a root commit and nothing points there.
In these drawings it's easy to see where the chain ends, but in a real Git repository, with real—and random-looking—hash IDs, it's hard. We need a quick and easy way to find commit C
, at this point. So, partly as a sop to us mere humans who can't remember raw hash IDs, Git provides us with branch names.
A branch name simply holds the hash ID of one commit. In our case, we'll use the name master
to hold the actual hash ID of the commit we're calling C
. We can draw that like this:
A--B--C <-- master
The name master
points to the last commit in the chain, i.e., C
. I've stopped drawing the internal arrows because they get harder to draw soon. Just remember that they always point backwards. They're a consequence of the parent
lines in the actual commits, so they can never change—unlike the branch name arrows, which do change.
Let's add a new commit D
to our chain, by doing git checkout master
and then updating files, git add
-ing, and git commit
-ing. New commit D
will point back to existing commit C
as D
's parent:
A--B--C <-- master
\
D
Then, since D
is the new last commit for the chain, Git will update our current branch name, master
, to point to D
instead of C
. We can also move D
up a line in the drawing. The result is:
A--B--C--D <-- master
To add a branch, we just create a new name, pointing to the current commit. Let's make new branch dev
and have it point to D
:
A--B--C--D <-- dev, master
Now we need one additional bit of notation: we have to remember which branch name we have checked out. We have commit D
out, but when we add new commit E
, only one of the two branch names should move. For this purpose, we'll add (HEAD)
after one of the branch names. That's the name we have out; git commit
will move that name. So if we git checkout dev
, we stay with commit D
—and all of its files—and then when we make the new commit we get:
A--B--C--D <-- master
\
E <-- dev
1Technically, it's the SHA-1 checksum of the word commit
followed by a space followed by the size in bytes, expressed in decimal, which is 280
, followed by an ASCII NUL character, followed by the bytes of the content:
$ (printf 'commit 280\0'; git cat-file -p HEAD) | shasum
08da6496b61341ec45eac36afcc8f94242763468 -
This technique works for all four of Git's object types: print the name of the type, a space, the size of the object in bytes as decimalized ASCII, an ASCII NUL, and then the content. For commits, the content is the text of the commit, which is in the order: tree
and hash, each parent in order, author
line, committer
line, any additional header lines, blank line, subject-and-body. For annotated tags, look at any annotated tags. Blob objects have the obvious text. The tricky one is tree objects, which are binary and have several semi-arbitrary rules applied to enforce ordering. Tree objects are going to be the big stumbling block in converting to SHA-256.
2This allows tree objects to be shared: if two different commits save the same snapshot, they don't need to re-save the tree. This sharing applies recursively to sub-trees and, of course, to individual blob objects as well. After all, once an object is written, no part of it can be changed—the hash ID checksum would change and the result would be a different object—so if you have the same content, it reduces to the same hash ID. The object database, which is a simple key-value store, detects that the key is already present and therefore does not bother to store the data again.
3Note that if you make the same commit, with the same tree, same author, same log message, and—crucially—same time stamps and parents, you get the same commit. If any of those are different, including a parent hash or a time stamp, you get a new and different commit. So, if you re-commit the same snapshot, you do it at a different time and with a different parent commit: the new commit is therefore not the same as the old one. Hence the new commit gets a new hash. Even if you use GIT_AUTHOR_DATE
and GIT_COMMITTER_DATE
to force the new commit to share the old commit's time-stamp, the parent hash will be different: the new commit refers back to some other existing commit, different from the original commit's parent(s), so again, re-committing the same snapshot gives you a new and different commit.
The only way to get Git to re-use the commit hash is to commit the same snapshot, at the same time, with the same parents. In this case, it's OK to re-use the original commit, because the new commit is the original commit. They are truly identical. Nobody can tell them apart, so they might as well be the same commit.
4You can generate commit graphs that have more than one root commit, but many repositories have just the one original root. It takes one of several slightly odd tricks to acquire an additional root commit.
A merge commit, in Git, is any commit with at least two parents (and usually only two parents). We can't make a merge commit until we have a fork in the graph. For instance, suppose the graph is linear up to commit H
, after which it diverges like this:
I--J <-- master (HEAD)
/
...--G--H
\
K--L <-- dev
We've already done a git checkout master
, which means we have commit J
checked out right now. The name dev
identifies commit L
.
We now run git merge dev
. Git uses the graph to find the point at which the two branches were last brought together, which in this case is obviously commit H
. Git calls this the merge base commit. The other two interesting commits are our current commit, J
, and the one that the name dev
points to, L
.
Git will now, in effect:
H
commit vs our current commit J
: this is what we changed since the base;H
vs the other commit L
: this is what they changed since the base;The combined changes get applied to the contents from commit H
. That way, we get our changes plus their changes. The precise details of the combining process can get pretty tricky, especially in the face of conflicts, but here, we're more interested in the result. We'll assume there were no conflicts.
Since we ran git merge
(not git merge --squash
), Git will now commit the result. The new commit advances our current branch, just like any other commit, but this time the new commit points back to both of the two branch-tip commits. Hence we can draw the result like this:
I--J
/ \
...--G--H M <-- master (HEAD)
\ /
K--L <-- dev
Commit M
has two parents, J
and L
.
Of course, we can add more every-day commits if we like:
I--J
/ \
...--G--H M--N <-- master (HEAD)
\ /
K--L <-- dev
New commit N
has M
as its parent, as usual. When we have Git show us the commits that are reachable from master
, Git will start with N
, then move back one step to M
and show us commit M
. Now Git has a choice: should it move back one step to J
, or one step to L
?
In general, Git actually does both of these.5 Neither parent is any "better" or "more important" than the other, so Git will follow both tracks—both legs of the merge—and visit commits J
and I
, but also L
and K
. These tracks re-combine at H
, so once Git reaches H
it can move on to G
and F
and so on as before.
Note that merge commit M
has its own snapshot, just like any other commit. The only thing that is special about M
here is that it has two parents. The history, starting from M
and working backwards, goes down both legs, then rejoins.
One reason this is so important is that a future merge's merge base depends on the current merge. Let's draw a slightly different variant in which dev
gets merged into master
more than once. We'll start with just one merge M
in master
, like this:
...--H---L--M--P <-- master
\ /
I--J--N--O <-- dev
(Yes, there's no K
, I left it out to make the merge get the letter M
. :-) )
Now we will git checkout master
and git merge dev
to make a new merge q
. Git will have to find the merge base between commit P
and commit O
. To do that, it works backwards from P
, going to M
and then to both L
and J
. It also works backwards from O
, going to N
and then J
.
Commit J
is reachable from both branches, and is the correct merge base. It is the merge base because M
links back to both L
and J
. So that backwards link going from M
to J
is important here. If we didn't have it, Git would have to keep going further, from L
to H
, and from J
to I
to H
.
In other words, without the link afforded by merge commit M
, the next merge would have to look at a diff from H
to P
(our changes) and from H
to O
(their changes) that includes changes already merged. With the link, Git looks at a diff from J
to P
—which in effect is what we did in L
and P
—and from J
to O
, which in effect is what they did in N
and O
. So the subsequent merge is easier (not trivial, of course, but easier).
5With git log
and some other Git commands, you can use --first-parent
to tell it to ignore the second parent, and any additional parents if they exist, for the purpose of that one command. Note that the first parent of a merge is chosen at the time the merge is made. It cannot be changed later. It's always the commit you were on when you, or whoever, made the merge. It's up to the viewer to decide whether this first parent is somehow "more important" than any other parent, but the only built-in flag here is --first-parent
: there's no --second-parent
, so either all parents are equal, or first-parent is more important.
Suppose that instead of making merge M
with git merge dev
, we made it with git merge --squash dev
. Git would still:
H
;H
vs L
to see what we changed;H
vs J
to see what they changed; andH
.The command-line git merge --squash
flag also sets the --no-commit
flag, so that Git forces us to run git commit
to make new commit M
. Most of the clicky buttons on web interfaces go ahead and make the commit themselves, and there's no good reason any more for command-line Git to stop like this, except that backwards compatibility requires it. In any case, when we make the merge—or some clicky button does it—we get:
...--H---L--M <-- master
\
I--J <-- dev
Everything else is exactly the same, but commit M
has no back-link to dev
.
If we now go on as before and have:
...--H---L--M--P <-- master
\
I--J--N--O <-- dev
the next merge we run, with or without --squash
, to bring dev
into master
, will compare commit H
to commits P
and O
. If we had to do something clever about merge conflicts when we made M
, we'll probably have to do it again.
What this means in practice is that after git merge --squash
, you should almost always delete the branch your just merged. That is, having merged commit J
, you should delete the name dev
:
...--H---L--M <-- master
\
I--J [abandoned]
Commits I
and J
remain in your own repository for a while—exactly how long is kind of complicated—but with no name to find them, you'd have to remember their hash IDs, or at least the hash ID of commit J
.
There's a further complication here if anyone has made some other name point to either commits I
or J
or any commit from which we can find J
or I
. For instance, if we have:
...--H---L--M <-- master
\
I--J <-- dev
\
K <-- d2
and then we delete the name dev
, so that commit J
becomes impossible to find unless you remember its hash ID, we end up with what looks like this:
...--H---L--M <-- master
\
I--K <-- d2
which now looks like commit I
itself is something important we did while working on d2
, rather than a side effect of starting work on d2
after starting work on dev
and making commit I
.
This is a general rule about Git, not something specific to squash merges: it's the commits that matter, not so much the branch names. A branch name is just a starting—or ending—point we use to find commits. So when you move or delete branch names, if the commits can still be found from some other branch name, they're still in there. Even if they can't be found, they're probably still in your repository somewhere. Your Git will toss them out eventually, later, after deciding that they're unwanted rubbish.
We find commits by starting at branch names and working backwards. When we find some commit this way, we say that it's on the branch, but it may be more accurate to say that it is contained within the branch. Any one commit may be contained in many branches. The set of branches that contain a commit changes when we add, delete, or move the branch names.
So the rule we must remember is this: The commit graph never changes except by adding commits or not being able to find commits. The branch names change all the time—most of the time, in such a way that we find new commits and keep finding the old ones too.
I'm going to skip over most of the complications that come up as we add multiple Git repositories. A git push
just sends commits to some other Git repository, asking them to set one of their branch names to point to the last of these new commits. A git pull
starts by running git fetch
, which gets any new commits from some other Git repository and then updates your own Git's remote-tracking names—origin/master
and the like—to remember the last commit in their corresponding branch. The git merge
you might do after a git fetch
sometimes does a real merge, as described above, to make a new merge commit. Sometimes it does a fast-forward operation, which amounts to checking out their tip commit and dragging your own branch name forward to include all the new commits introduced by this. And of course, git pull
just means run git fetch
, then run a second Git command, usually git merge
.
The complications matter, but the actual merging, if any, always takes place in our own repository.6 We also git fetch
anyone else's commits as needed. We end up holding all the commits locally, and can visualize our Git's commits—even if it got them from some other Git—as just existing in our repository.
A typical feature development cycle includes daily commits and pushes on a feature branch. It also includes once daily git pulls from master to stay in sync with other changes. Sometimes this involves resolving merge conflicts.
So what we end up with here is a graph that looks like this, after several days of work:
...--H--K--L--N--O--R <-- master
\ \ \
I--J--M--P--Q--S <-- feature
where M
and Q
are real merge commits, some of which might involve resolving merge conflicts.
If we believe feature
is ready now, we can:
git checkout master
git merge feature
which will produce this:
...--H--K--L--N--O--R---T <-- master
\ \ \ /
I--J--M--P--Q--S <-- feature
The two parents of T
are R
(first parent) and S
(second). We can now delete the name feature
entirely:
...--H--K--L--N--O--R---T <-- master
\ \ \ /
I--J--M--P--Q--S
Commit S
remains reachable by walking back through T
to its second parent. The merge itself is done by comparing the merge base snapshot in commit O
to the snapshots in commits R
and S
and merging those.
Or, we can git merge --squash
, after which we really should delete feature
entirely. The result before deleting feature
looks like this:
...--H--K--L--N--O--R--T <-- master
\ \ \
I--J--M--P--Q--S <-- feature
The contents of commit T
were made the same way in both cases; it's just that T
has no second parent this time (so it is by definition not a merge commit).
Assuming we don't try to remember commits by hash ID and just forget about all the unreachable ones, the result when we delete the name feature
looks like this:
...--H--K--L--N--O--R--T <-- master
6With the clicky web UI buttons, the merge takes place in their Git repository, after which we must git fetch
their merge and then update our own branch names to incorporate the new merge commit. In general, hosting services do not allow merges that have conflicts, as the hosting services don't provide a way to resolve the conflicts. Some hosting services might add such features over time, but I'd prefer to do conflicted merges locally, myself: you have a lot more data available easily this way.
Your question asked about safety, but without a definition of safe it's not really answerable. What we can say is that the work we did in individual commits I
, J
, P
, and S
is no longer find-able. Instead, when we look at the snapshots that remain findable, it looks as if someone wrote all the feature
code/changes overnight.
This has good and bad aspects. For some features, the good will outweigh the bad: there's no distractive second-parent of merge commit T
to walk through to see individual commits like I
, J
, P
, and S
, and no need to worry about any of the merge commits in that path. But what if there's a bug in the all-in-one changes in T
? Maybe that bug is easy to find in, say, commit J
or commit P
, and hard to find in T
. So maybe it would have been better to retain the individual commits.
If you're sure you never want the original commits back, squashing is the way to go. If you're not sure, it may not be.
There is a compromise position, which in some ways may be the best one of all. Instead of squash-merging, you can rebase the dev
branch onto feature
. What rebase does is to copy commits. Rebasing has its own drawbacks, though.
Here is what we have before merging:
...--H--K--L--N--O--R <-- master
\ \ \
I--J--M--P--Q--S <-- feature
There are six commits reachable from the name feature
that are not reachable from the name master
: I
, J
, M
, P
, Q
, and S
.
Doing git checkout feature; git rebase master
will direct Git to find those six commits, throw out any merge commits from the list—leaving I-J-P-S
—and then copy those commits, one at a time, so that the new copies come after commit R
. Each copy is done as if by git cherry-pick
.7 Unfortunately, each cherry-pick can have conflicts and if so, you must resolve them. They tend to be the same conflicts that you might have resolved at the now-omitted merges, and will tend to need the same resolution.8 After the copying, Git moves the branch name, so that the end result looks like this:
I'-J'-P'-S' <-- feature
/
...--H--K--L--N--O--R <-- master
\ \ \
I--J--M--P--Q--S [abandoned, but the name ORIG_HEAD works to find S]
The names I'
, J'
, and so on are meant to imply that even though the new commit has a new and different hash ID, it's a copy of the original commit.
You can compare the snapshot in the copy S'
to the snapshot in S
using:
git diff ORIG_HEAD HEAD
If they do not match, you probably made a mistake while re-resolving conflicts.9
Now that you have everything neatly rebased, you can do either a fast-forward merge:
I'-J'-P'-S' <-- feature, master
/
...--H--K--L--N--O--R
or a true merge:
I'-J'-P'-S' <-- feature
/ \
...--H--K--L--N--O--R------------T <-- master
and then in either case delete the name feature
whenever it should be deleted.
The big drawback to rebasing is that everyone who was working with feature
must switch to using the new rebased-feature
commit copies. If the feature is ready, and will have its name deleted, that's usually pretty easy. But it definitely falls into the other rule I like to use about Git: Only replace commits with new-and-improved copies if everyone who is using those commits has agreed to move to the new-and-improved copies. This agreement should usually be made in advance.
7Rebase has a bewildering set of options, including -i
, -k
, -m
, and -s
, and some of these options cause it to actually use git cherry-pick
. Other methods use git format-patch
and git am
internally, but are intended to produce the same result.
8There is another Git feature and command, git rerere
with an enable option, that allows Git to do this for you, but always remember that Git is not smart, it's just applying simple text substitution rules. How rerere
works and what can go wrong is rather tricky and I won't go into details here.
9Note that if ORIG_HEAD
gets written by some subsequent operation, you can also use feature@{1}
to find S
. If you're getting into a complicated situation where you need to know the hash ID for S
, either copy-paste it somewhere, or create a temporary branch or tag name to hold it.
In practice, when I'm rebasing a complicated feature, what I like to do is:
git checkout feature
git branch feature.0 # or .1, .2, etc if I already have .0, etc
git rebase master
Now I have the name feature.0
to remember the original branch tip commit.
Upvotes: 13
Reputation: 76784
When Git performs a three-way merge (which is the default style of merge), it considers three points: the merge base (usually, the common ancestor) and the two heads. It doesn't consider any of the commits in between in any way.
So if the state of the files (the root tree) in each of those commits is the same regardless of whether you're squash merging or not squash merging, then the results will be the same either way, and merge conflicts will be no better or worse.
Now, whether it is as easy to look through the history and find the intended behavior when a merge conflict does occur is a different story; since all you have is giant squashed commits on each side, figuring out the right resolution may be more difficult. But the conflicts themselves shouldn't be any different.
Upvotes: 2
Reputation: 30277
No reason to be afraid... lots of people do IT with git rebase -i
. I personally follow an alternative path that won't leave stuff behind and that doesn't involve rebasing:
git checkout my-feature-branch
git pull # merge changes from upstream, do _not_ rebase.
# correct conflicts if they show up and finish merge
# after merge is finished/committed the only differences between your branch and upstream are related to _your_ feature and so....
git reset --soft the-upstream-branch # set branch pointer to upstream branch, all differences are set on index ready to be committed
git commit -m "the feature in a single revision"
Upvotes: 1