Reputation: 63

Is Merge Request Squash Commit Option Safe

My development team recently moved to using Gitlab. I had suggested that on merge requests we need to squash commits. I got a lot of push back that it would not be safe.

A typical feature development cycle includes daily commits and pushes on a feature branch. It also includes once daily git pulls from master to stay in sync with other changes. Sometimes this involves resolving merge conflicts.

They believe all of this merging from master and then performing a merge request to master and squashing commits could result in possible undetected merge problems.

Is there any truth to this? My assumption is that squashing commits should be safe no matter what.

Upvotes: 5

Answers (3)

torek

Reputation: 489163

At the risk of repeating (some or even all of) what is in bk2204's answer, I'll take a stab at this.

(Note that there's a problem with any answer, because you used the word safe. That's not a term with a clear and obvious technical definition. Something totally safe in one situation might be less-so in another. Analogies always fall apart when pushed too hard, but consider, e.g., walking down the street without wearing, hand, knee, and elbow protection, vs zooming down the street on rollerblades—or should I say "inline skates"?—without said protection.)

I always like to say that Git is not about files but rather about commits (which then hold files). That holds in this case as well:

Each commit has a snapshot of (all of your) files, plus some metadata.
Merging simply combines snapshots—typically 3 of them—according to a defined set of rules.
Regular merge and squash-merge use the same mechanism to achieve the final set of files that are to go into the next commit.
But, the commits made by git merge --squash (or its equivalent clicky button on some web interface) are different from the commits made by git merge (or its equivalent clicky button).

In this case, the entire difference lies in the parent linkage of the final commit. On its face, this might seem like a tiny little difference ... but the way commits link up, in Git, is history, in Git. There's no such thing as file history: there only on commit history. So this small difference is actually huge—especially when you consider how squash-merges should, in general, affect peoples' future behaviors.

To see how this works in practice, it's necessary to understand the commit graph. This gets us into the metadata I mentioned in the first bullet point above.

Visualizing the commit graph

I find it helpful, sometimes, to look at the actual, concrete content of a real commit. Here's a commit in the Git repository for Git itself:

$ git rev-parse HEAD
08da6496b61341ec45eac36afcc8f94242763468
$ git cat-file -p 08da6496b61341ec45eac36afcc8f94242763468 | sed 's/@/ /'
tree 27fee9c50528ef1b0960c2c342ff139e36ce2076
parent 07f25ad8b235c2682a36ee76d59734ec3f08c413
author Junio C Hamano <gitster pobox.com> 1570770961 +0900
committer Junio C Hamano <gitster pobox.com> 1570771489 +0900

Eighth batch

Signed-off-by: Junio C Hamano <gitster pobox.com>

(I replaced the @ with space just to maybe, possibly, cut down on the amount of spam that goes to pobox.com. Note that you can actually just git cat-file -p HEAD, too: I use the raw hash ID here to help cement the idea that it's the hash ID that matters.)

This is actually the full and complete text of the commit. The hash ID, 08da6496b61341ec45eac36afcc8f94242763468, is a cryptographic checksum of this stuff.¹ Git stores the snapshot itself indirectly, as a tree object,² which has its own hash ID. The remainder of the text is the rest of the metadata: author and committer (name, email address, and date-and-time stamp); parent commits, listed by hash ID; and the log message.

So, each commit has its own unique hash ID. That hash ID refers to that commit; no other commit can ever have the same hash ID.³ Moreover, each commit refers, by hash ID, to its immediate parent commit(s). These parent hash IDs are required to be the IDs of some commits that existed before this commit was made. For ordinary (non-merge) commits, the (single) parent is whichever commit you had checked out at the time you made the commit.

When we look at a new Git repository with just three commits in it, we can draw this out by using single uppercase letters to stand in for the actual hash IDs:

A <-B <-C

Here, commit A is the first commit we ever made. It has no parent, because the parent is the commit that existed before commit A, and there isn't one. This special case is what Git calls a root commit.⁴ But then once we had commit A, we made commit B, so B's parent is A. Similarly, we were using commit B when we made commit C, so C's parent is B.

When something holds the hash ID of a commit, we say that that thing points to the commit. So commit C points to B, which points to A, which is a root commit and nothing points there.

In these drawings it's easy to see where the chain ends, but in a real Git repository, with real—and random-looking—hash IDs, it's hard. We need a quick and easy way to find commit C, at this point. So, partly as a sop to us mere humans who can't remember raw hash IDs, Git provides us with branch names.

A branch name simply holds the hash ID of one commit. In our case, we'll use the name master to hold the actual hash ID of the commit we're calling C. We can draw that like this:

A--B--C   <-- master

The name master points to the last commit in the chain, i.e., C. I've stopped drawing the internal arrows because they get harder to draw soon. Just remember that they always point backwards. They're a consequence of the parent lines in the actual commits, so they can never change—unlike the branch name arrows, which do change.

Let's add a new commit D to our chain, by doing git checkout master and then updating files, git add-ing, and git commit-ing. New commit D will point back to existing commit C as D's parent:

A--B--C   <-- master
       \
        D

Then, since D is the new last commit for the chain, Git will update our current branch name, master, to point to D instead of C. We can also move D up a line in the drawing. The result is:

A--B--C--D   <-- master

To add a branch, we just create a new name, pointing to the current commit. Let's make new branch dev and have it point to D:

A--B--C--D   <-- dev, master

Now we need one additional bit of notation: we have to remember which branch name we have checked out. We have commit D out, but when we add new commit E, only one of the two branch names should move. For this purpose, we'll add (HEAD) after one of the branch names. That's the name we have out; git commit will move that name. So if we git checkout dev, we stay with commit D—and all of its files—and then when we make the new commit we get:

A--B--C--D   <-- master
          \
           E   <-- dev

¹Technically, it's the SHA-1 checksum of the word commit followed by a space followed by the size in bytes, expressed in decimal, which is 280, followed by an ASCII NUL character, followed by the bytes of the content:

$ (printf 'commit 280\0'; git cat-file -p HEAD) | shasum
08da6496b61341ec45eac36afcc8f94242763468  -

This technique works for all four of Git's object types: print the name of the type, a space, the size of the object in bytes as decimalized ASCII, an ASCII NUL, and then the content. For commits, the content is the text of the commit, which is in the order: tree and hash, each parent in order, author line, committer line, any additional header lines, blank line, subject-and-body. For annotated tags, look at any annotated tags. Blob objects have the obvious text. The tricky one is tree objects, which are binary and have several semi-arbitrary rules applied to enforce ordering. Tree objects are going to be the big stumbling block in converting to SHA-256.

²This allows tree objects to be shared: if two different commits save the same snapshot, they don't need to re-save the tree. This sharing applies recursively to sub-trees and, of course, to individual blob objects as well. After all, once an object is written, no part of it can be changed—the hash ID checksum would change and the result would be a different object—so if you have the same content, it reduces to the same hash ID. The object database, which is a simple key-value store, detects that the key is already present and therefore does not bother to store the data again.

³Note that if you make the same commit, with the same tree, same author, same log message, and—crucially—same time stamps and parents, you get the same commit. If any of those are different, including a parent hash or a time stamp, you get a new and different commit. So, if you re-commit the same snapshot, you do it at a different time and with a different parent commit: the new commit is therefore not the same as the old one. Hence the new commit gets a new hash. Even if you use GIT_AUTHOR_DATE and GIT_COMMITTER_DATE to force the new commit to share the old commit's time-stamp, the parent hash will be different: the new commit refers back to some other existing commit, different from the original commit's parent(s), so again, re-committing the same snapshot gives you a new and different commit.

The only way to get Git to re-use the commit hash is to commit the same snapshot, at the same time, with the same parents. In this case, it's OK to re-use the original commit, because the new commit is the original commit. They are truly identical. Nobody can tell them apart, so they might as well be the same commit.

⁴You can generate commit graphs that have more than one root commit, but many repositories have just the one original root. It takes one of several slightly odd tricks to acquire an additional root commit.

Merge commits and the commit graph

A merge commit, in Git, is any commit with at least two parents (and usually only two parents). We can't make a merge commit until we have a fork in the graph. For instance, suppose the graph is linear up to commit H, after which it diverges like this:

          I--J   <-- master (HEAD)
         /
...--G--H
         \
          K--L   <-- dev

We've already done a git checkout master, which means we have commit J checked out right now. The name dev identifies commit L.

We now run git merge dev. Git uses the graph to find the point at which the two branches were last brought together, which in this case is obviously commit H. Git calls this the merge base commit. The other two interesting commits are our current commit, J, and the one that the name dev points to, L.

Git will now, in effect:

diff the contents of the merge base H commit vs our current commit J: this is what we changed since the base;
diff H vs the other commit L: this is what they changed since the base;
do its best to combine these two sets of changes.

The combined changes get applied to the contents from commit H. That way, we get our changes plus their changes. The precise details of the combining process can get pretty tricky, especially in the face of conflicts, but here, we're more interested in the result. We'll assume there were no conflicts.

Since we ran git merge (not git merge --squash), Git will now commit the result. The new commit advances our current branch, just like any other commit, but this time the new commit points back to both of the two branch-tip commits. Hence we can draw the result like this:

          I--J
         /    \
...--G--H      M   <-- master (HEAD)
         \    /
          K--L   <-- dev

Commit M has two parents, J and L.

Of course, we can add more every-day commits if we like:

          I--J
         /    \
...--G--H      M--N   <-- master (HEAD)
         \    /
          K--L   <-- dev

New commit N has M as its parent, as usual. When we have Git show us the commits that are reachable from master, Git will start with N, then move back one step to M and show us commit M. Now Git has a choice: should it move back one step to J, or one step to L?

In general, Git actually does both of these.⁵ Neither parent is any "better" or "more important" than the other, so Git will follow both tracks—both legs of the merge—and visit commits J and I, but also L and K. These tracks re-combine at H, so once Git reaches H it can move on to G and F and so on as before.

Note that merge commit M has its own snapshot, just like any other commit. The only thing that is special about M here is that it has two parents. The history, starting from M and working backwards, goes down both legs, then rejoins.

One reason this is so important is that a future merge's merge base depends on the current merge. Let's draw a slightly different variant in which dev gets merged into master more than once. We'll start with just one merge M in master, like this:

...--H---L--M--P   <-- master
      \    /
       I--J--N--O   <-- dev

(Yes, there's no K, I left it out to make the merge get the letter M. :-) )

Now we will git checkout master and git merge dev to make a new merge q. Git will have to find the merge base between commit P and commit O. To do that, it works backwards from P, going to M and then to both L and J. It also works backwards from O, going to N and then J.

Commit J is reachable from both branches, and is the correct merge base. It is the merge base because M links back to both L and J. So that backwards link going from M to J is important here. If we didn't have it, Git would have to keep going further, from L to H, and from J to I to H.

In other words, without the link afforded by merge commit M, the next merge would have to look at a diff from H to P (our changes) and from H to O (their changes) that includes changes already merged. With the link, Git looks at a diff from J to P—which in effect is what we did in L and P—and from J to O, which in effect is what they did in N and O. So the subsequent merge is easier (not trivial, of course, but easier).

⁵With git log and some other Git commands, you can use --first-parent to tell it to ignore the second parent, and any additional parents if they exist, for the purpose of that one command. Note that the first parent of a merge is chosen at the time the merge is made. It cannot be changed later. It's always the commit you were on when you, or whoever, made the merge. It's up to the viewer to decide whether this first parent is somehow "more important" than any other parent, but the only built-in flag here is --first-parent: there's no --second-parent, so either all parents are equal, or first-parent is more important.

A squash merge is not a merge

Suppose that instead of making merge M with git merge dev, we made it with git merge --squash dev. Git would still:

find merge base H;
diff H vs L to see what we changed;
diff H vs J to see what they changed; and
combine the two diffs, applying the combination to the snapshot from H.

The command-line git merge --squash flag also sets the --no-commit flag, so that Git forces us to run git commit to make new commit M. Most of the clicky buttons on web interfaces go ahead and make the commit themselves, and there's no good reason any more for command-line Git to stop like this, except that backwards compatibility requires it. In any case, when we make the merge—or some clicky button does it—we get:

...--H---L--M   <-- master
      \
       I--J   <-- dev

Everything else is exactly the same, but commit M has no back-link to dev.

If we now go on as before and have:

...--H---L--M--P   <-- master
      \
       I--J--N--O   <-- dev

the next merge we run, with or without --squash, to bring dev into master, will compare commit H to commits P and O. If we had to do something clever about merge conflicts when we made M, we'll probably have to do it again.

What this means in practice is that after git merge --squash, you should almost always delete the branch your just merged. That is, having merged commit J, you should delete the name dev:

...--H---L--M   <-- master
      \
       I--J   [abandoned]

Commits I and J remain in your own repository for a while—exactly how long is kind of complicated—but with no name to find them, you'd have to remember their hash IDs, or at least the hash ID of commit J.

Complications

There's a further complication here if anyone has made some other name point to either commits I or J or any commit from which we can find J or I. For instance, if we have:

...--H---L--M   <-- master
      \
       I--J   <-- dev
        \
         K   <-- d2

and then we delete the name dev, so that commit J becomes impossible to find unless you remember its hash ID, we end up with what looks like this:

...--H---L--M   <-- master
      \
       I--K   <-- d2

which now looks like commit I itself is something important we did while working on d2, rather than a side effect of starting work on d2 after starting work on dev and making commit I.

This is a general rule about Git, not something specific to squash merges: it's the commits that matter, not so much the branch names. A branch name is just a starting—or ending—point we use to find commits. So when you move or delete branch names, if the commits can still be found from some other branch name, they're still in there. Even if they can't be found, they're probably still in your repository somewhere. Your Git will toss them out eventually, later, after deciding that they're unwanted rubbish.

We find commits by starting at branch names and working backwards. When we find some commit this way, we say that it's on the branch, but it may be more accurate to say that it is contained within the branch. Any one commit may be contained in many branches. The set of branches that contain a commit changes when we add, delete, or move the branch names.

So the rule we must remember is this: The commit graph never changes except by adding commits or not being able to find commits. The branch names change all the time—most of the time, in such a way that we find new commits and keep finding the old ones too.

With all this in mind, let's look at your particular situation

I'm going to skip over most of the complications that come up as we add multiple Git repositories. A git push just sends commits to some other Git repository, asking them to set one of their branch names to point to the last of these new commits. A git pull starts by running git fetch, which gets any new commits from some other Git repository and then updates your own Git's remote-tracking names—origin/master and the like—to remember the last commit in their corresponding branch. The git merge you might do after a git fetch sometimes does a real merge, as described above, to make a new merge commit. Sometimes it does a fast-forward operation, which amounts to checking out their tip commit and dragging your own branch name forward to include all the new commits introduced by this. And of course, git pull just means run git fetch, then run a second Git command, usually git merge.

The complications matter, but the actual merging, if any, always takes place in our own repository.⁶ We also git fetch anyone else's commits as needed. We end up holding all the commits locally, and can visualize our Git's commits—even if it got them from some other Git—as just existing in our repository.

A typical feature development cycle includes daily commits and pushes on a feature branch. It also includes once daily git pulls from master to stay in sync with other changes. Sometimes this involves resolving merge conflicts.

So what we end up with here is a graph that looks like this, after several days of work:

...--H--K--L--N--O--R   <-- master
      \     \     \
       I--J--M--P--Q--S   <-- feature

where M and Q are real merge commits, some of which might involve resolving merge conflicts.

If we believe feature is ready now, we can:

git checkout master
git merge feature

which will produce this:

...--H--K--L--N--O--R---T   <-- master
      \     \     \    /
       I--J--M--P--Q--S   <-- feature

The two parents of T are R (first parent) and S (second). We can now delete the name feature entirely:

...--H--K--L--N--O--R---T   <-- master
      \     \     \    /
       I--J--M--P--Q--S

Commit S remains reachable by walking back through T to its second parent. The merge itself is done by comparing the merge base snapshot in commit O to the snapshots in commits R and S and merging those.

Or, we can git merge --squash, after which we really should delete feature entirely. The result before deleting feature looks like this:

...--H--K--L--N--O--R--T   <-- master
      \     \     \
       I--J--M--P--Q--S   <-- feature

The contents of commit T were made the same way in both cases; it's just that T has no second parent this time (so it is by definition not a merge commit).

Assuming we don't try to remember commits by hash ID and just forget about all the unreachable ones, the result when we delete the name feature looks like this:

...--H--K--L--N--O--R--T   <-- master

⁶With the clicky web UI buttons, the merge takes place in their Git repository, after which we must git fetch their merge and then update our own branch names to incorporate the new merge commit. In general, hosting services do not allow merges that have conflicts, as the hosting services don't provide a way to resolve the conflicts. Some hosting services might add such features over time, but I'd prefer to do conflicted merges locally, myself: you have a lot more data available easily this way.

Conclusion

Your question asked about safety, but without a definition of safe it's not really answerable. What we can say is that the work we did in individual commits I, J, P, and S is no longer find-able. Instead, when we look at the snapshots that remain findable, it looks as if someone wrote all the feature code/changes overnight.

This has good and bad aspects. For some features, the good will outweigh the bad: there's no distractive second-parent of merge commit T to walk through to see individual commits like I, J, P, and S, and no need to worry about any of the merge commits in that path. But what if there's a bug in the all-in-one changes in T? Maybe that bug is easy to find in, say, commit J or commit P, and hard to find in T. So maybe it would have been better to retain the individual commits.

If you're sure you never want the original commits back, squashing is the way to go. If you're not sure, it may not be.

There is a compromise position, which in some ways may be the best one of all. Instead of squash-merging, you can rebase the dev branch onto feature. What rebase does is to copy commits. Rebasing has its own drawbacks, though.

Here is what we have before merging:

...--H--K--L--N--O--R   <-- master
      \     \     \
       I--J--M--P--Q--S   <-- feature

There are six commits reachable from the name feature that are not reachable from the name master: I, J, M, P, Q, and S.

Doing git checkout feature; git rebase master will direct Git to find those six commits, throw out any merge commits from the list—leaving I-J-P-S—and then copy those commits, one at a time, so that the new copies come after commit R. Each copy is done as if by git cherry-pick.⁷ Unfortunately, each cherry-pick can have conflicts and if so, you must resolve them. They tend to be the same conflicts that you might have resolved at the now-omitted merges, and will tend to need the same resolution.⁸ After the copying, Git moves the branch name, so that the end result looks like this:

                      I'-J'-P'-S'  <-- feature
                     /
...--H--K--L--N--O--R   <-- master
      \     \     \
       I--J--M--P--Q--S   [abandoned, but the name ORIG_HEAD works to find S]

The names I', J', and so on are meant to imply that even though the new commit has a new and different hash ID, it's a copy of the original commit.

You can compare the snapshot in the copy S' to the snapshot in S using:

git diff ORIG_HEAD HEAD

If they do not match, you probably made a mistake while re-resolving conflicts.⁹

Now that you have everything neatly rebased, you can do either a fast-forward merge:

                      I'-J'-P'-S'  <-- feature, master
                     /
...--H--K--L--N--O--R

or a true merge:

                      I'-J'-P'-S'  <-- feature
                     /          \
...--H--K--L--N--O--R------------T   <-- master

and then in either case delete the name feature whenever it should be deleted.

The big drawback to rebasing is that everyone who was working with feature must switch to using the new rebased-feature commit copies. If the feature is ready, and will have its name deleted, that's usually pretty easy. But it definitely falls into the other rule I like to use about Git: Only replace commits with new-and-improved copies if everyone who is using those commits has agreed to move to the new-and-improved copies. This agreement should usually be made in advance.

⁷Rebase has a bewildering set of options, including -i, -k, -m, and -s, and some of these options cause it to actually use git cherry-pick. Other methods use git format-patch and git am internally, but are intended to produce the same result.

⁸There is another Git feature and command, git rerere with an enable option, that allows Git to do this for you, but always remember that Git is not smart, it's just applying simple text substitution rules. How rerere works and what can go wrong is rather tricky and I won't go into details here.

⁹Note that if ORIG_HEAD gets written by some subsequent operation, you can also use feature@{1} to find S. If you're getting into a complicated situation where you need to know the hash ID for S, either copy-paste it somewhere, or create a temporary branch or tag name to hold it.

In practice, when I'm rebasing a complicated feature, what I like to do is:

git checkout feature
git branch feature.0    # or .1, .2, etc if I already have .0, etc
git rebase master

Now I have the name feature.0 to remember the original branch tip commit.

Upvotes: 13

bk2204

Reputation: 76784

When Git performs a three-way merge (which is the default style of merge), it considers three points: the merge base (usually, the common ancestor) and the two heads. It doesn't consider any of the commits in between in any way.

So if the state of the files (the root tree) in each of those commits is the same regardless of whether you're squash merging or not squash merging, then the results will be the same either way, and merge conflicts will be no better or worse.

Now, whether it is as easy to look through the history and find the intended behavior when a merge conflict does occur is a different story; since all you have is giant squashed commits on each side, figuring out the right resolution may be more difficult. But the conflicts themselves shouldn't be any different.

Upvotes: 2

eftshift0

Reputation: 30277

No reason to be afraid... lots of people do IT with git rebase -i. I personally follow an alternative path that won't leave stuff behind and that doesn't involve rebasing:

git checkout my-feature-branch
git pull # merge changes from upstream, do _not_ rebase.
# correct conflicts if they show up and finish merge
# after merge is finished/committed the only differences between your branch and upstream are related to _your_ feature and so....
git reset --soft the-upstream-branch # set branch pointer to upstream branch, all differences are set on index ready to be committed
git commit -m "the feature in a single revision"

Upvotes: 1