How does Git reflect changes from its objects onto the file system?

Question

I've gone through the Git Internals book and mostly understand how Git structures things into blobs, trees, commits and that branches are lightweight pointers to commits.

The part I don't quite grasp is how Git reflects these changes onto the file system across branch/commit checkouts.

For example:

Consider two files, A.txt and B.txt, committed to Commit 1. In addition to the two files, a file C.txt is committed to Commit 2.

From what I understand, the object graph would be along the lines of the following:

Commit 1 points to Tree 1 which has blobs for the two initial files - BlobA and BlobB
Commit 2 points to Tree 2 which has blobs for three files. BlobA and BlobB remain the same since their content has not changed, while BlobC will also be under Tree 2.

Now, if I'm currently at Commit 2 and checkout to Commit 1, HEAD now points to Commit 1, and we can traverse the directed graph that tells the state of the repository. Now, the file C.txt is no longer on the file system.

How does Git reflect the state of the object graph onto the file system on every checkout?

Thanks.

torek · Accepted Answer

Most of Git's work-tree actions are actually controlled via the index. This means no graph traversal is required at all!

The index's primary role (outside of merges at least) is to act as the place in which you build up the next commit to make. This gives it the name that many people prefer to use, the staging area. In the index, the version of a file such as README.txt will start out matching the HEAD version of that same file. Both files are actually stored as a blob object in the repository.

The work-tree will contain a usable version of README.txt, representing the expanded version of the file. This is also smudge-filtered and CRLF-adjusted, if you have established such filtering. If you change the work-tree version, and wish to commit the changes, you must run git add README.txt: this copies the work-tree file back into the index, applying any clean filter and doing the CRLF-to-LF adjustment if you have those enabled, creating a new blob in the repository (or re-using an existing blob if the new file content matches some existing content) and storing the new hash into the index. In effect, this replaces the index copy of the file.

So far so good—but what happens when you have some commit checked out, e.g., as the result of git checkout master, and you issue the command git checkout develop? Here, the index takes on its second role, which is to keep track of—i.e., index—the work-tree and to keep cache information about the work-tree. (This is also the source of its third name, the cache.)

Git already translated master into a commit hash to extract that commit, but at this point it does so again. At this point, Git is using the so-called two tree merge mode of the git read-tree command. It also translates develop into a commit hash, so now it has two commit hashes, for master (current commit) and develop (desired commit). After making sure that these are indeed commits,¹ Git translates them into tree hash IDs: the HEAD tree, and the desired or target tree.

Meanwhile the index lists the hash ID for each tracked file in the work-tree. In the ideal case, for each file F in the index and/or in the HEAD commit, F will have the same hash in both HEAD and index. If so, the index copy of F is itself "clean" (matches). The work-tree copy may or may not be clean (may or may not match the index copy)—the index's role as cache here helps make this last test very fast in most cases.

For each file F that exists in both HEAD and the target tree, either the target hash for F matches the HEAD hash for F, or it doesn't. For files that do not exist in the target tree, but do exist in HEAD, either the index and work-tree copy of F are clean, or they're not. If the file is clean, it's safe to remove both copies (if the file is not in the target) or replace both with the version from the target tree (if the file is in the target). But if the target tree hash matches the HEAD hash, there is no need to touch the index and work-tree entry at all, so Git doesn't.

In short, it's only where HEAD and target trees don't match that Git needs to change anything in the work-tree to achieve the checkout. If the part that does not match is that file xyz.txt is in HEAD but not in the target, the goal becomes remove xyz.txt—but this is only allowed if it is "clean", unless of course you add --force to your git checkout. If the part that does not match is that the file is in neither HEAD nor the index, but is in the target, the goal becomes to create xyz.txt with the target's content—but this is only allowed if the file does not exist, or if the file is listed in an ignore directive.²

¹Branch names are required to identify commit hashes at all times. (Tag names are permitted to identify other types of objects.) So in theory there is no need to check this. Whether Git really does check, depends on the code path.

²This last part is the source of some serious pain at times. Git really should—but does not—distinguish between "ignore this file because it's easy to re-create" and "ignore this file because it should not be committed, but never clobber it because it contains something precious like user configuration data."

How does Git reflect changes from its objects onto the file system?

Answers (1)

Related Questions