Interpretation of commit parent-child relationship

Question

Let's say commit A1 is a parent of commit A2. What does it really tell me?

To clarify my question, here are two incorrect interpretations:

1) Commit A2 was created based on commit A1 in the sense that the user checked out A1, made a few edits, and committed A2 (without any intervening git commands). This is wrong due to rebasing.

2) Each git commit stores the delta relative to its parent, so you have to follow the arrows in reverse direction and apply each delta to reconstruct the contents of a commit. This is wrong because unlike many other VCS, git commits store complete snapshots rather than deltas.

Here's an example of an interpretation that seems almost right, but is very vague:

3) Commit A2 incorporates all the work represented by commit A1 plus some additional work. "Work" is used in the simple sense of adding, deleting and editing files.

torek · Accepted Answer

Interpretation 2 is outright wrong, but it contains one correct item: you do (or Git does) have to follow the backwards-arrows that Git stores, in order to construct the graph. Each commit "points to" its parent commits (by storing their true-name hash IDs), making each commit act as a single vertex (or node) plus a set of outgoing arcs that, once collected up, form a directed acyclic graph or DAG. In most diagrams in CS or informatics, we'd have the outgoing arcs go from parents to children, but in Git the arrows are all backwards. (This is so that parents do not need to know their child IDs before the children exist, while also allowing the parent commits to be read-only once created. Since each hash ID is determined solely by each object's contents, and they are deliberately difficult to compute, no hash ID can be known until the contents are known. The parent commits therefore must be read-only: you cannot update them to add their children; that would change their hash IDs.¹)

Interpretation 1 is mostly correct, but is missing some key items. As Jim Deville said in his answer, Git's various plumbing commands allow you to construct nearly-arbitrary commit graph nodes (i.e., commit objects). The command git commit-tree in particular takes any number of valid parent commit IDs (-p options), one valid tree ID, and a commit message, and constructs a new commit from these, using your configuration and your computer's idea of the current time to set the author and committer name, email, and timestamp fields (or using the environment variable overrides if they are set). The new commit object is stored in the database with nothing pointing to it, so you must quickly² set a reference (such as a branch or tag name) to retain it. (Or, you can create another commit to retain the just-created commit, but then that commit requires either a name, or another commit which requires something, and so on.)

This means that the parent information is up to the command that creates the commit.

When you use git rebase, the step that creates the new commit is usually—or might as well be—git commit itself, and git commit sets the new commit's parent based on the result of reading HEAD (and then immediately updates HEAD or, more normally, the branch that HEAD names). A rebase operation generally works with a "detached HEAD", where HEAD contains the raw hash ID of an existing commit, instead of the more normal case of HEAD containing a branch-name.

Hence, rebase works by detaching HEAD so that it points to the --onto target (which defaults to the argument), then making commits, one at a time. It makes each new commit by converting the original commit into a delta, applying the delta to the current index-and-work-tree, and making a commit a la git commit. (The actual mechanics of rebase are implemented using either git cherry-pick or git am, both of which are written in C and use the code from git commit. However, an interactive rebase may, in some cases, such as for squash steps or when using --root, literally run git commit rather than, or in addition to, running git cherry-pick. A --preserve-merge rebase uses the interactive machinery and literally runs git merge to create new merges. The details get fairly complicated.)

Note that the conversion, from snapshot to changeset / delta, is done by running git diff against the commit's recorded parent. Hence, setting a weird parent ID is not useful. You can do it (with git commit-tree) but unless you will never cherry-pick or rebase or git show the commit, all of which use the parent ID to change snapshot to delta, this would be poor planning.

¹One could, of course, split each commit object into a read-only portion that participates in the hashing, and a read/write portion that does not. That would allow Git to add child IDs to parents. But this would make Git less stable and less secure: read-only objects tend not to get corrupted as much as read/write objects, and having part of a commit not participate in its hash would mean that that part was not protected by the hash either.

²By default, git gc --auto, which other Git commands run from time to time, gives you two weeks to finish this task. If it takes you longer than that, an automatic git gc may prune away your as-yet-unreferenced commit.

Interpretation of commit parent-child relationship

Answers (2)

Related Questions