Trying to understand git tracking, staging, etc

Question

I've seen many threads on this but the answers tend to be inconsistent and contradictory, so if possible I'm after a short-as-possible explanation in no uncertain terms:

To my current understanding, we have a working directory of files that we are making our project with, a staging area, and a repository. The staging area is technically also an index and uses a cache, just a big list somewhere of all the files from the working directory that are slated for a commit, which copies the incremental changes over to the repository.

So for instance I have a file, test.txt, I write "1234" inside it and add it to the staging area then commit it to the repo, so I assume the entirety of test.txt is saved in some repo file somewhere. Then I edit the file and change the text to "1235" and commit that change. I assume the repo now doesn't save another copy of "1235" but just notes something like "the fourth character changed from '4' to '5'" or something.

Anyways before I started this, test.txt was untracked. Then when I did git add test.txt it now becomes a tracked file, a new file, and a staged file that is on the index.

Then if I do git commit -m "some message" the file becomes... what? Committed? Tracked but not staged?

And then if I edit the file again it becomes tracked and modified, but not necessarily staged unless I add it again, etc?

I am trying to understand what is defined as what where. Is my understanding right so far? Where do I need correction?

torek · Accepted Answer

Before I dive into the rest of this, the definition for tracked file is trivial: a file is tracked if and only if it is currently in the index.

That's all you need to test: "Is path P in the index?" If so, P is tracked. If not, P is untracked (and hence either just untracked, or perhaps untracked and also ignored).

The tricky part is figuring out what's in the index!

Direct questions and answers

To my current understanding, we have a working directory of files that we are making our project with, a staging area, and a repository. The staging area is technically also an index ...

Yes. The repository itself is a kind of database; inside this database we have four types of objects. The most interesting at this point are commit objects,¹ which act as snapshots, made by saving—as a permanent copy, with some additional information—whatever is in the index at the time you run git commit. The second most interesting is the blob object. Blobs have multiple uses, but the main one is to store file content, which we'll see in a moment.

Crucially, there is always a current commit. You may name in any number of ways, but the one way that always works is the word HEAD. The symbol @ also usually works (it only fails in truly ancient versions of Git). Which commit is "current", though, changes over time.

... and uses a cache,

Cache is just a third term for the index / staging-area.

just a big list somewhere of all the files from the working directory that are slated for a commit, which copies the incremental changes over to the repository.

This is not quite right.

The format of stuff in the index / staging-area / cache is usually not interesting (and is poorly documented and subject to change as well), but as already noted, each commit acts as a complete snapshot, not a change-set.

So for instance I have a file, test.txt, I write "1234" inside it and add it to the staging area then commit it to the repo, so I assume the entirety of test.txt is saved in some repo file somewhere.

It—the 1234 data, that is—is saved in an object, specifically one of the blob objects mentioned above. That does not necessarily mean file. Repository objects have an internal format, about which relatively few promises are made. If you want to delve into these details, you should know that they may be stored loose (one per file in separate files), or packed (many in one pack file, with a pack index that is unrelated to the index/staging-area/cache index).

You are promised that you can extract any object, by ID, back to its original form, using either git cat-file --batch (this produces raw data and is a bit tricky to use) or git cat-file -p (this produces a pretty-printed variant and is easy to use). For tag and commit objects, the original data are generally printable already, so git cat-file -p prints it as is. Tree objects are mixed, so git cat-file -p turns known binary contents into text. Blob objects are saved exactly as is.²

Then I edit the file and change the text to "1235" and commit that change. I assume the repo now doesn't save another copy of "1235" but just notes something like "the fourth character changed from '4' to '5'" or something.

No, this is quite wrong. The new commit has a new, complete copy of the new content. Moreover, each blob is a complete snapshot of that particular version of that particular file-content, regardless of the file's name. (If both of these blobs are loose objects, and both are very large files—say, 4.5 GB of DVD data each, and not very compressible—then you've just used 9 GB of disk space for these two loose objects. See the object compression and packing section below for when this is or is not a problem.)

If you store the same content in two separate commits, though, whether under the same name or a different name, you store the blob just once. This holds even if you store one file twice in a single commit. For instance, the hash ID of any file consisting of just the text Hello world (and a newline) is 3b18e512dba79e4c8300dd08aeb37f8e728b8dad:

$ echo hello world | git hash-object -t blob --stdin
3b18e512dba79e4c8300dd08aeb37f8e728b8dad

If you make six files containing just the one line hello world, your tree object (see footnote 1) will have six names associated with this hash ID:

$ for i in 1 2 3 4 5 6; do
>     echo hello world > copy_$i.txt; git add copy_$i.txt
> done
$ git commit -m 'six is just one'
[master (root-commit) 5a66ef1] six is just one
 6 files changed, 6 insertions(+)
 create mode 100644 copy_1.txt
 create mode 100644 copy_2.txt
 create mode 100644 copy_3.txt
 create mode 100644 copy_4.txt
 create mode 100644 copy_5.txt
 create mode 100644 copy_6.txt
$ git cat-file -p HEAD^{tree}
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    copy_1.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    copy_2.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    copy_3.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    copy_4.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    copy_5.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    copy_6.txt

and there we are: six files, one object.

Aside from that, though, let's say your test.txt file has 1234 plus a newline. Its hash ID is then:

$ echo 1234 | git hash-object -t blob --stdin
81c545efebe5f57d4cab2ba9ec294c4b0cadf672

Every file in the universe—more precisely, the universe of all files that will ever be in your repository³—that has the content 1234 has this hash. It doesn't matter if it's named test.txt. It doesn't matter who made it, or when. It doesn't matter if it's stored on a local drive or in the cloud.⁴ All that matters is that 1234 hashes to the above number.

Meanwhile, 1235 hashes to a different number:

$ echo 1235 | git hash-object -t blob --stdin
d729899c33fcf5c75fda5369a64898c85a46bcf7

So your 1235 contents go into this other blob.

If the blob is already in the database, nothing interesting happens: you just re-use it. If it is not in the database, Git adds it to the database. The blob must have a unique hash ID, different from every other object in the database. It always does, though; see footnote 3 again.

¹For completeness, the four object types are tag or annotated tag; commit; tree; and blob. A commit is typically quite short: try git cat-file -p HEAD to see your current commit. Note that it refers, by hash ID, to exactly one tree. This tree in turn refers to each of your files, giving their names and their blob hash IDs. If you have sub-directories, they are stored as sub-trees.

²You can enable particular conversions, such as end of line transformations, using .gitattributes and other tricks. These conversions happen during index-vs-work-tree operations; the in-repo representation always exactly matches the in-index representation, for technical reasons (in particular, the index stores only the object hash ID, so the object must be in "repo format" by this point).

³This is how you can work the breaking of SHA-1 into a problem for Git. Simply find two different files whose blob-hash is the same, and you can no longer store both of those files in the same repository. Since Git's blob-hash is not quite a straight SHA-1 hash of the two files, the example file that the researchers have supplied is not itself a problem for Git. Some other file-pair would be.

The chance of any two object IDs colliding at random is (currently, with SHA-1) 1 in 2¹⁶⁰, which is so tiny that we can ignore it. However, due to the birthday paradox, it's wise to keep the number of objects in any Git repository under a few quadrillion. If the average object took only one byte, you'd run into problems after a few thousand terabytes, i.e., a few exabytes. Obviously the average object size is larger, so figure many exabytes of repository size before problems become likely—except, of course, for engineered cracking of Git.

⁴Git doesn't literally store things in cloud storage, as it "likes" local file systems, although there's no reason you could not use a cloud-backed local file system. In fact, Dropbox and the like try to do just that—but they also insist on resolving actions taken on different computers using files of the same name, which interferes rather badly with Git's operation, since Git needs to maintain its own metadata about its internal files. If you use Git-LFS, that has its own trick for offloading large files into separate storage areas, using "clean" and "smudge" filters and some clever .gitattributes file hacking.

Understanding (and viewing) the index

If you want to view the index directly, Git has what it calls a plumbing command—i.e., a command not meant for humans to use—to do this: git ls-files. Try it out—especially with the --stage argument—but note that in a big repository, it produces far too much output.

It's usually best to view the index indirectly. Remember, there is always a current commit (HEAD or @). There is always a current work-tree that you can see, too. Suppose you have these two files:

   HEAD       index     work-tree
---------   ---------   ---------
README.md   README.md   README.md
test.txt    test.txt    test.txt

You can run two git diffs, one to compare HEAD vs index, and one to compare index vs work-tree. This is what git status does.

If all three versions of both files are exactly the same, and you run git status, it says nothing at all.

If you change the work-tree version of test.txt and run git status, it finds no difference in HEAD vs index, but it finds index vs work-tree are different. So it tells you that you have unstaged changes to test.txt.

If you copy the new work-tree version into the index and run git status again, now the index and work-tree match, but the first diff—HEAD vs index—no longer matches. So now git status says that you have staged changes.

If you add a new file z to the work-tree, the picture looks like this:

   HEAD       index     work-tree
---------   ---------   ---------
README.md   README.md   README.md
test.txt    test.txt    test.txt
                        z

Now git status says that you have an untracked file, because z is not in the index. Use git add z to copy it to the index and you get this picture:

   HEAD       index     work-tree
---------   ---------   ---------
README.md   README.md   README.md
test.txt    test.txt    test.txt
            z           z

Run git status again and it does two git diffs again. Now z is in the index (is tracked) but is not in HEAD, so it is a new file and is staged. It's the same in the index and the work-tree so git status doesn't mention it a second time.⁵

⁵I actually like the output from git status --short, which shows you both diffs at once for each file. Untracked files get two question marks, while tracked files get one or two letters, placed into the first two columns, to describe HEAD-vs-index and index-vs-work-tree.

Object compression and packing

notes something like "the fourth character changed from '4' to '5'" ...

Git does do this sort of thing, but—in a departure from most version control systems—it does this kind of delta compression at a lower level than that of blobs.

Loose objects are simply compressed with zlib. You can find these objects in .git/objects/, under their hash-ID names (with the first two characters split off to make a directory, so that directories do not get too large). Open one of these files, read it, and de-zlib-compress it, and you will see the loose object data format.

When there are enough loose objects to make it worthwhile, though, Git packs the loose objects. This packing uses a heuristic algorithm (because doing a perfect job takes far too much computation), but it essentially amounts to finding objects that "smell enough alike" to suggest that delta encoding one object based upon some other object will make for a smaller pack file.

If the object packer picks these two files, it will notice that one is 1234 and the other is 1235 and that it can represent the second object as "replace the 4th character of previous object". There's no particular promise that 1235 will be based on 1234—it could go in the other order—but as a rule Git tries to keep the "most recent" objects in the "least compressed" form on the theory that newer history is accessed more often than older history.

Note that one object can be based on a previous object that itself is based on yet-previous objects. This is called a chain or delta chain: we must expand each object down to the base in order to apply each delta in turn in order to come up with the last object in the chain. Git's deltifier will limit delta chain lengths; see the --depth argument to the various things that invoke it.

(In this particular case, the object ID is far longer than the object's content, so there is no benefit at all to trying to make a delta here. The principle applies to bigger files, though. Note also that delta compression should always be applied before any binary compression: delta compression relies on smaller Shannon entropy values in the input files, while compressors like gzip and bzip2 work by squeezing out such entropy.)

The format of pack files has changed several times. See also Are Git's pack files deltas rather than snapshots?

Trying to understand git tracking, staging, etc

Answers (1)

Direct questions and answers

Understanding (and viewing) the index

Object compression and packing

Related Questions