Reputation: 19969
I have heard several times that Git keeps an internal record of the "changes it knows about". And that this is often why we need to backmerge the changes (say a hotfix on staging to dev) or problems inherent in rebasing. I have the feeling this is something like "a first class object" where the phrase used is a bit ambiguous and used loosely and the underlying reality is a bit more straightforward.
So what exactly does "changes it knows about" mean? Where in the Git source code does each branch maintain a list of sha1's that it "knows about" (maybe here: https://github.com/git/git/blob/master/commit-graph.c somewhere)? And what happens when Git changes a sha1 - for example, say somebody on the release branch does a squash commit of a,b,c to make d?
Does the release branch know about a,b,c and on a merge back to dev would it also pass that information or just the presence of d? What if dev branch already had commit a - would it be smart enough to manage the presence of d and a simultaneously?
Upvotes: 3
Views: 398
Reputation: 488173
Since this is long, let me answer the last part first, then clear up the other bits:
... for example, say somebody on the release branch does a squash commit of a,b,c to make d. Does the release branch know about a,b,c and on a merge back to dev would it also pass that information or just the presence of d?
A squash, whether obtained via git rebase -i
or git merge --squash
, is a new and different commit. In most cases, there is no easy way to tell that this new commit d
is equivalent to a+b+c
. Actions that involve copying commits or their effects, i.e., that copy commit d
or its effects into dev
, or a+b+c
into release
, may see merge conflicts, but may not! It depends on the both the action and the changes.
I have heard several times that git keeps an internal record of the "changes it knows about".
This is not even true in a vague sense. What Git does is to compute change-sets as needed. Understanding this is one of the keys to using Git. In particular we need to understand how Git will compute change-sets when you ask it to perform the action of merge (to merge, or merge-as-a-verb, as I like to put it).
Git stores a commit graph, in which each commit links back to its immediate-predecessor, i.e., parent, commit, or for merge commits, two or more such predecessors. Each commit itself also stores, indirectly, a snapshot—not changes, but rather, whole snapshots.
The precise details are not really important at this level, but for concreteness, what's in the main Git database is a set of objects, each being one of four types: commit, tree (which allows Git to locate files within commits), blob (which mostly stores a file's content), and annotated tag (exclusively for annotated tags). Git as a whole is a series of databases: this main one, which is a simple key-value store indexed by hash IDs, plus a bunch of auxiliary ones. The one crucial auxiliary database is another key-value store that turns names into hash IDs.
To some extent, "change" vs "snapshot" is irrelevant, and to some extent it isn't. To use an algebraic example (which will fall apart if you push too hard on it, but should get the point across): suppose I tell you that today, it is 5 degrees warmer or colder than it was yesterday. Now I ask you: what is today's temperature? You can't tell from this alone! If I tell you that it was 16ºC yesterday, now you can tell, because now you have a snapshot value to go with the 5º delta. On the other hand, if I say that one day is 16ºC and the next day is 21ºC, you can find the delta from the two snapshots.
In short, given the snapshots in commits, you (or Git) can produce deltas, but given only deltas, you cannot produce snapshots. Meanwhile, using the linkages, from commits to their parents, you can also produce the commit graph: a graph is defined mathematically as a pair of sets G = (V, E) where V is the vertex-set and E is the edge-set. Git stores the individual edges in the nodes that represent the vertices. Those nodes are the commit objects in the object database.
The auxiliary database I mentioned above comes into play right here. To gain entry into the graph, Git needs some set of starting-point hash IDs. There is another option, which maintenance commands like git fsck
and git gc
use: they simply find every object in the main database. But this is too slow for normal work, and would make it much harder to discard unwanted objects, so Git has that auxiliary name-to-hash-ID database: a branch name like master
turns into one, and only one, hash ID. For a branch name, this particular hash ID locates a commit, which Git then calls the tip of the branch.
No object in the main database can ever be changed. This means no commits ever change—not one single bit. No files ever change either: to change a file, we just store a copy of the new version, as a new blob object. The old commit, which holds the old file, links to the old blob. The new commit, with the new file, links to the new blob. In general–there are specific exceptions—we only ever add things to the main database. To add a new commit to a branch, we write out the new commit, which links to all of its files and also to its parent commit. This gives us a new hash ID, which we then store into the branch name.
The effect is that the branch name always stores only the last hash ID for the branch. We use that to find the commit, and then use the commit's parent hash ID to find the previous commit, and so on.
... And what happens when git changes a sha1
Git never changes a hash ID. While the hash IDs are currently SHA-1, there is a migration plan to switch to another hash algorithm, and it's not really necessary to assume SHA-1, so let's just call these "hash IDs" or "object IDs", as Git is starting to do internally. The TLA for this is OID, so let's use OID here. The OID for any object is simply a checksum of the object's content, including the type-header that Git sticks on the front.1 The OID hash algorithm must be quite good to prevent hash collisions (see How does the newly found SHA-1 collision affect Git?). Every commit has a time-stamp to guarantee that each commit will get a unique OID.2
1This is required so that a commit object and a blob object have different checksums, even if you extract the contents of the commit (with or without the header) and store it as a blob. These two objects need to have distinct hash IDs, otherwise the blob could not be stored!
2The time stamp has a one-second granularity, so if you make two 100% identical commits on two different branches all within one second, you get two names pointing to the same commit. The effect is that you have "fast-forward merged" the two branches. However, the branches have to have started out merged to achieve this result, so it's actually OK, in a technical sense; it's just surprising. ("Fast-forward merge" is kind of a misnomer, too, but this is already a footnote, so let me stop here....)
At the core of all modern version control systems (and even many older ones), we have algorithms for doing the so-called three way merge. To do this, we need to turn snapshots into change-sets. See also Why is a 3-way merge advantageous over a 2-way merge? and especially VonC's answer, which illustrates the three-way merge for a single file.
What's clever about Git—though some other modern VCSes do it too now—is that it automatically finds the correct merge base snapshot, using the graph. If we draw the graph, we can see how this works. To merge feature
into mainline
, we run git checkout mainline
to attach HEAD
to it and make L
(whatever its actual hash ID is) the current commit:
...--o--o--B---o--L <-- mainline (HEAD)
\
o--o--R <-- feature
We then run git merge feature
to select commit R
to be merged. Git now uses the commit graph to find the best common ancestor commit, which becomes our merge base B.
Git now turns commit L
, a snapshot, into a change-set to be applied to commit B
:
git diff --find-renames <hash-of-B> <hash-of-L> # what we changed
It does the same with B
-vs-R
:
git diff --find-renames <hash-of-B> <hash-of-R> # what they changed
Having computed these two change-sets, Git can now combine the change-sets as shown in VonC's answer, one file at a time, applying the combined changes to the snapshot in B
. The result, assuming that all goes well anyway, is a new snapshot—let's call it M
for merge—that we commit as usual, making it become the tip of the current branch. What's special about M
is that it links back to both L
and R
:
...--o--o--B---o--L--M <-- mainline (HEAD)
\ /
o--o--R <-- feature
No existing commit changes (this being impossible), but mainline
now locates commit M
, which has as its snapshot the result of merging (as a verb) the changes in both L
and R
, with respect to the merge base B
. Commit M
is a merge commit—merge being an adjective here—or even just "a merge", with merge being a noun, because it has two parent commits, L
and R
.
Note that if we continue development on feature
and eventually run another git merge
, the merge base this time is not commit B
, but rather our original commit R
. Let's see how that is:
...--o--o--B---o--L--M--o--T <-- mainline (HEAD)
\ /
o--o--R--o--o--U <-- feature
To find the best common ancestor, Git starts at the two tip commits—now T
and U
respectively—and works backwards, following the backwards-looking links. T
goes back to a boring commit o
and then to M
, and from M
to both L
and R
. U
goes back through two boring o
s to R
. We can keep going back and find B
too, but R
is closer to the ends, so it is the new merge base.
To make a squash merge (as in git merge --squash
), Git does the same merge as a verb step as before, getting two diffs and combining the change-sets. But now, instead of making merge commit M
, Git makes3 a single parent, ordinary commit S
:
...--o--o--B---o--L--S <-- mainline (HEAD)
\
o--o--R <-- feature
Since commit S
only links back to L
, not to R
, it's impossible to tell from the graph alone that S
is the result of a merge. The effect is that feature
, as a branch, should probably be killed off: struck from our drawing, with the three commits on that branch allowed to fade away and eventually be removed (via the maintenance—and very slow!—git gc
operation, which Git does automatically in the background whenever it seems appropriate).
If we don't kill off feature
, and instead continue developing and then do another merge operation—squash or not—we get:
...--o--o--B---o--L--S--o--T <-- mainline (HEAD)
\
o--o--R--o--o--U <-- feature
The merge base this time is still B
, so Git compares B
-vs-T
to see what we did, and B
-vs-U
to see what they did. Since "we" made all their changes in S
, those changes will definitely overlap. But the idea behind the three-way merge is to take each change once. If it's still clear that we took the changes without also changing them more, we'll be OK! It's when we or they seem to have changed the existing changes some more that we will get a merge conflict, because as far as Git can tell, the changes in T
now clash with those in U
. When we did a real merge, the merge base was R
, not B
, so we see far fewer clashing changes.
3For no particularly good reason, --squash
always turns on --no-commit
, so that git merge
does not make the commit itself. You must run git commit
manually to finish the job. (I believe this is an artifact of the original implementation. This stop-after-verb-part behavior really should be removed since you can run git merge --squash --no-commit
now, but this would change the command's observable behavior, and the Git folks do not like to do that.)
The fundamental idea behind cherry-pick is to copy some set of changes. To do that, we must, as usual, turn a commit—a snapshot—into a change-set. This means using git diff
, just as we did with merge. Suppose, for instance, we have the following branch names and commit graph fragment:
...--o--o--H <-- us (HEAD)
\
o--o--B--C <-- them
Here H
is our HEAD commit, and we'd like to cherry-pick commit C
from branch them
. We simply run git cherry-pick them
, and behind the scenes, Git runs:
git diff --find-renames <hash-of-B> <hash-of-C> # what they changed
The way Git finds commit B
is trivial: it's the parent of C
!
Having found these changes, Git needs to apply them to our snapshot. It could just try to apply them directly,4 but it turns out to work better to do a full blown three way merge-as-a-verb into H
using B
as the merge base, so that's what Git does. Once the merge-as-a-verb is done, Git makes an ordinary (non-merge) commit that has the same change as C
, but applied to H
because it's combined with the B
-vs-H
change-set.
The result looks a lot like squash "merge" since the action is essentially the same one. However, since Git by default copies the commit message (and author!) as well, we can call the new commit C'
to indicate that it's a copy of C
:
...--o--o--H--C' <-- us (HEAD)
\
o--o--B--C <-- them
As with squash "merge", repeatedly cherry-picking commits from one branch into another sets you up for potential merge conflicts later, if the change-set that got merged is "touched" by a later commit.
4In fact, both git cherry-pick
and git rebase
did exactly that at one time, back in the distant past (Git 1.5-ish). As noted above, though, a real three-way merge works better in general, so cherry-pick now uses three-way merge. Meanwhile git rebase
optionally uses three-way merges: git rebase -i
literally re-uses the cherry-pick code, and git rebase -m
runs three-way merges, but some cases of old non-interactive git rebase
still use git format-patch
along with git apply
.
Git stores snapshots, and computes change-sets on the fly from those snapshots.
Git stores the commits as a graph—specifically, a Directed Acyclic Graph, which has certain nice properties mathematically—and uses the graph to find merge bases from which to compute change-sets.
Git uses branch names to identify particular commits, which it calls tip commits, within the graph. The name always points to whichever commit is to be considered the last commit that is contained within the branch. Since the graph itself has places that diverge (branch) and rejoin (merge), commits often belong to more than one branch at a time. The set of branches that contain a commit is constantly changing, as branch names are added and removed. The graph itself remains constant! The names are merely pointers, pointing into the graph.
Although it's not covered in this answer, being reachable from some name is crucial to every commit. The maintenance git gc
operation will, eventually, remove any commit that is unreachable (from any name: branch, tag, or other reference) from the database. For much more on reachability, see Think Like (a) Git.
Upvotes: 7