How is squashing different from just taking the last commit?

Question

Explanations of squashing usually say the process incorporates the changes of a commit into previous commit(s), resulting in a single commit.

However, I am very confused about what this is actually supposed to mean, because commits do not represent deltas but complete versions of the project.

Let's say I have four commits

A <-- B <-- C <-- D

and I check out D and interactively rebase on A, squashing B, C, and D into a single commit BCD.

The result is:

A <-- BCD

My question is how the tree above is different from

A <-- D

because no matter what examples I tried the working directory of BCD always looked like just D. I would be grateful for an example where A <-- BCD and A <-- D differ.

CLARIFICATION

It seems my question caused some confusion, so here is an alternative wording:

If I opened commit D in the .git/objects folder with an editor and changed the parent pointer from C to A, and then delete the commits B and C from the .git/objects folder, I would get

A <-- D

Question: is the tree above identical to A <-- BCD, the tree I get by squashing B, C and D in an interactive rebase?

(And in case the trees are identical, would I have arrived at yet the same result by picking "drop" for B and C in the interactive rebase instead of using squashing?)

torek · Accepted Answer

As matt said in his answer, the trick here is that most of the commands we use for managing the commits wind up turning the snapshots into change-sets. When they deal with commits as change-sets they really do have to do this sort of replaying, even though the smart way (as noted below) wouldn't.

Longer

Remember that every commit has a unique hash ID, which will only ever refer to that particular commit. No part of any commit can ever be changed, including the back-pointer to its parent: if we try to change any part of any commit, we end up with a new, different commit, with a new, unique hash ID. So, given the original sequence:

A <-B <-C <-D   <--branchname

we're going to end up with:

  B <-C <-D   [abandoned]
 /
A <-E   <--branchname

no matter what else happens. We can have E be the equivalent of the squash of BCD, or we can have E be different in some way: perhaps we retain D's tree but use a different commit message than any of B, C, or D; and of course E points directly back to A, which is unlike C and D.

Using git rebase -i and replacing two picks with squashes uses a somewhat inefficient method of arriving at exactly that situation: it builds up, as a temporary commit that gets shoved aside, a combined BC commit with a combined message, and then readies (but doesn't quite commit yet) a combined BCD commit—one whose tree will match that of commit D—and a combined message. It then invokes your editor on the combined message.

If we replace the two picks with fixups, Git still builds the combined BC commit but uses B's message, then makes as our E commit the combined BCD commit using B's message.

The efficient way to handle this would be to make a single E commit that uses D's tree and a message. Git could be taught to do this, but that's a special case that just falls out of the easier add-one-at-a-time method. (It's possible that since the rewrite of rebase into C, Git actually has been taught to do this—I have not looked into the inner workings lately; the last time I did, the shell script based rebase definitely made separate commits.)

You can also run git merge --squash, which ends up doing this more cleverly, but to do that you need to assign a branch name to point to commit A:

A   <-- branch1 (HEAD)
 \
  B--C--D   <-- branch2

With branch1 checked out as shown, running git merge --squash branch2 && git commit will produce:

A---------E   <-- branch1 (HEAD)
 \
  B--C--D   <-- branch2

without the excess compute-work that the rebase method might use. But computer time is usually pretty cheap, and this requires more human-time to set up the multiple branch names. (You need the && git commit because --squash always turns on --no-commit.)

Compare the squash-merge result to a regular merge:

A---------E   <-- branch1 (HEAD)
 \       /
  B--C--D   <-- branch2

The difference is that a regular merge records two parents for new commit E, and a squash-merge doesn't. The lack of extra compute work occurs because there is no commit after A; Git realizes that moving from A to D is a fast forward operation, i.e., that it means just use D's snapshot. Had we started with:

A--E--F   <-- branch1 (HEAD)
 \
  B--C--D   <-- branch2

there would have been real work to do, for either squash-merge or real-merge, and Git would have done that work to produce new merge, or non-merge, commit G:

A--E--F---G   <-- branch1 (HEAD)
 \       ?
  B--C--D   <-- branch2

(where ? means that there's an arrow back from G to D for the real-merge case, but not for the squash-merge case).

How is squashing different from just taking the last commit?

Answers (2)

Longer

Related Questions