Reputation: 3
I've seen many threads on this but the answers tend to be inconsistent and contradictory, so if possible I'm after a short-as-possible explanation in no uncertain terms:
To my current understanding, we have a working directory of files that we are making our project with, a staging area, and a repository. The staging area is technically also an index and uses a cache, just a big list somewhere of all the files from the working directory that are slated for a commit, which copies the incremental changes over to the repository.
So for instance I have a file, test.txt
, I write "1234" inside it and add it to the staging area then commit it to the repo, so I assume the entirety of test.txt
is saved in some repo file somewhere. Then I edit the file and change the text to "1235" and commit that change. I assume the repo now doesn't save another copy of "1235" but just notes something like "the fourth character changed from '4' to '5'" or something.
Anyways before I started this, test.txt
was untracked. Then when I did git add test.txt
it now becomes a tracked file, a new file, and a staged file that is on the index.
Then if I do git commit -m "some message"
the file becomes... what? Committed? Tracked but not staged?
And then if I edit the file again it becomes tracked and modified, but not necessarily staged unless I add it again, etc?
I am trying to understand what is defined as what where. Is my understanding right so far? Where do I need correction?
Upvotes: 0
Views: 188
Reputation: 488123
Before I dive into the rest of this, the definition for tracked file is trivial: a file is tracked if and only if it is currently in the index.
That's all you need to test: "Is path P in the index?" If so, P is tracked. If not, P is untracked (and hence either just untracked, or perhaps untracked and also ignored).
The tricky part is figuring out what's in the index!
To my current understanding, we have a working directory of files that we are making our project with, a staging area, and a repository. The staging area is technically also an index ...
Yes. The repository itself is a kind of database; inside this database we have four types of objects. The most interesting at this point are commit objects,1 which act as snapshots, made by saving—as a permanent copy, with some additional information—whatever is in the index at the time you run git commit
. The second most interesting is the blob object. Blobs have multiple uses, but the main one is to store file content, which we'll see in a moment.
Crucially, there is always a current commit. You may name in any number of ways, but the one way that always works is the word HEAD
. The symbol @
also usually works (it only fails in truly ancient versions of Git). Which commit is "current", though, changes over time.
... and uses a cache,
Cache is just a third term for the index / staging-area.
just a big list somewhere of all the files from the working directory that are slated for a commit, which copies the incremental changes over to the repository.
This is not quite right.
The format of stuff in the index / staging-area / cache is usually not interesting (and is poorly documented and subject to change as well), but as already noted, each commit acts as a complete snapshot, not a change-set.
So for instance I have a file,
test.txt
, I write "1234" inside it and add it to the staging area then commit it to the repo, so I assume the entirety oftest.txt
is saved in some repo file somewhere.
It—the 1234\n
data, that is—is saved in an object, specifically one of the blob objects mentioned above. That does not necessarily mean file. Repository objects have an internal format, about which relatively few promises are made. If you want to delve into these details, you should know that they may be stored loose (one per file in separate files), or packed (many in one pack file, with a pack index that is unrelated to the index/staging-area/cache index).
You are promised that you can extract any object, by ID, back to its original form, using either git cat-file --batch
(this produces raw data and is a bit tricky to use) or git cat-file -p
(this produces a pretty-printed variant and is easy to use). For tag and commit objects, the original data are generally printable already, so git cat-file -p <object-id>
prints it as is. Tree objects are mixed, so git cat-file -p
turns known binary contents into text. Blob objects are saved exactly as is.2
Then I edit the file and change the text to "1235" and commit that change. I assume the repo now doesn't save another copy of "1235" but just notes something like "the fourth character changed from '4' to '5'" or something.
No, this is quite wrong. The new commit has a new, complete copy of the new content. Moreover, each blob is a complete snapshot of that particular version of that particular file-content, regardless of the file's name. (If both of these blobs are loose objects, and both are very large files—say, 4.5 GB of DVD data each, and not very compressible—then you've just used 9 GB of disk space for these two loose objects. See the object compression and packing section below for when this is or is not a problem.)
If you store the same content in two separate commits, though, whether under the same name or a different name, you store the blob just once. This holds even if you store one file twice in a single commit. For instance, the hash ID of any file consisting of just the text Hello world
(and a newline) is 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
:
$ echo hello world | git hash-object -t blob --stdin
3b18e512dba79e4c8300dd08aeb37f8e728b8dad
If you make six files containing just the one line hello world
, your tree
object (see footnote 1) will have six names associated with this hash ID:
$ for i in 1 2 3 4 5 6; do
> echo hello world > copy_$i.txt; git add copy_$i.txt
> done
$ git commit -m 'six is just one'
[master (root-commit) 5a66ef1] six is just one
6 files changed, 6 insertions(+)
create mode 100644 copy_1.txt
create mode 100644 copy_2.txt
create mode 100644 copy_3.txt
create mode 100644 copy_4.txt
create mode 100644 copy_5.txt
create mode 100644 copy_6.txt
$ git cat-file -p HEAD^{tree}
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad copy_1.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad copy_2.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad copy_3.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad copy_4.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad copy_5.txt
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad copy_6.txt
and there we are: six files, one object.
Aside from that, though, let's say your test.txt
file has 1234 plus a newline. Its hash ID is then:
$ echo 1234 | git hash-object -t blob --stdin
81c545efebe5f57d4cab2ba9ec294c4b0cadf672
Every file in the universe—more precisely, the universe of all files that will ever be in your repository3—that has the content 1234\n
has this hash. It doesn't matter if it's named test.txt
. It doesn't matter who made it, or when. It doesn't matter if it's stored on a local drive or in the cloud.4 All that matters is that 1234\n
hashes to the above number.
Meanwhile, 1235\n
hashes to a different number:
$ echo 1235 | git hash-object -t blob --stdin
d729899c33fcf5c75fda5369a64898c85a46bcf7
So your 1235\n
contents go into this other blob.
If the blob is already in the database, nothing interesting happens: you just re-use it. If it is not in the database, Git adds it to the database. The blob must have a unique hash ID, different from every other object in the database. It always does, though; see footnote 3 again.
1For completeness, the four object types are tag or annotated tag; commit; tree; and blob. A commit is typically quite short: try git cat-file -p HEAD
to see your current commit. Note that it refers, by hash ID, to exactly one tree
. This tree
in turn refers to each of your files, giving their names and their blob hash IDs. If you have sub-directories, they are stored as sub-trees.
2You can enable particular conversions, such as end of line transformations, using .gitattributes
and other tricks. These conversions happen during index-vs-work-tree operations; the in-repo representation always exactly matches the in-index representation, for technical reasons (in particular, the index stores only the object hash ID, so the object must be in "repo format" by this point).
3This is how you can work the breaking of SHA-1 into a problem for Git. Simply find two different files whose blob-hash is the same, and you can no longer store both of those files in the same repository. Since Git's blob-hash is not quite a straight SHA-1 hash of the two files, the example file that the researchers have supplied is not itself a problem for Git. Some other file-pair would be.
The chance of any two object IDs colliding at random is (currently, with SHA-1) 1 in 2160, which is so tiny that we can ignore it. However, due to the birthday paradox, it's wise to keep the number of objects in any Git repository under a few quadrillion. If the average object took only one byte, you'd run into problems after a few thousand terabytes, i.e., a few exabytes. Obviously the average object size is larger, so figure many exabytes of repository size before problems become likely—except, of course, for engineered cracking of Git.
4Git doesn't literally store things in cloud storage, as it "likes" local file systems, although there's no reason you could not use a cloud-backed local file system. In fact, Dropbox and the like try to do just that—but they also insist on resolving actions taken on different computers using files of the same name, which interferes rather badly with Git's operation, since Git needs to maintain its own metadata about its internal files. If you use Git-LFS, that has its own trick for offloading large files into separate storage areas, using "clean" and "smudge" filters and some clever .gitattributes
file hacking.
If you want to view the index directly, Git has what it calls a plumbing command—i.e., a command not meant for humans to use—to do this: git ls-files
. Try it out—especially with the --stage
argument—but note that in a big repository, it produces far too much output.
It's usually best to view the index indirectly. Remember, there is always a current commit (HEAD or @). There is always a current work-tree that you can see, too. Suppose you have these two files:
HEAD index work-tree
--------- --------- ---------
README.md README.md README.md
test.txt test.txt test.txt
You can run two git diff
s, one to compare HEAD vs index, and one to compare index vs work-tree. This is what git status
does.
If all three versions of both files are exactly the same, and you run git status
, it says nothing at all.
If you change the work-tree version of test.txt
and run git status
, it finds no difference in HEAD vs index, but it finds index
vs work-tree are different. So it tells you that you have unstaged changes to test.txt
.
If you copy the new work-tree version into the index and run git status
again, now the index and work-tree match, but the first diff—HEAD vs index—no longer matches. So now git status
says that you have staged changes.
If you add a new file z
to the work-tree, the picture looks like this:
HEAD index work-tree
--------- --------- ---------
README.md README.md README.md
test.txt test.txt test.txt
z
Now git status
says that you have an untracked file, because z
is not in the index. Use git add z
to copy it to the index and you get this picture:
HEAD index work-tree
--------- --------- ---------
README.md README.md README.md
test.txt test.txt test.txt
z z
Run git status
again and it does two git diff
s again. Now z
is in the index (is tracked) but is not in HEAD, so it is a new file and is staged. It's the same in the index and the work-tree so git status
doesn't mention it a second time.5
5I actually like the output from git status --short
, which shows you both diffs at once for each file. Untracked files get two question marks, while tracked files get one or two letters, placed into the first two columns, to describe HEAD-vs-index and index-vs-work-tree.
notes something like "the fourth character changed from '4' to '5'" ...
Git does do this sort of thing, but—in a departure from most version control systems—it does this kind of delta compression at a lower level than that of blobs.
Loose objects are simply compressed with zlib. You can find these objects in .git/objects/
, under their hash-ID names (with the first two characters split off to make a directory, so that directories do not get too large). Open one of these files, read it, and de-zlib-compress it, and you will see the loose object data format.
When there are enough loose objects to make it worthwhile, though, Git packs the loose objects. This packing uses a heuristic algorithm (because doing a perfect job takes far too much computation), but it essentially amounts to finding objects that "smell enough alike" to suggest that delta encoding one object based upon some other object will make for a smaller pack file.
If the object packer picks these two files, it will notice that one is 1234\n
and the other is 1235\n
and that it can represent the second object as "replace the 4th character of previous object". There's no particular promise that 1235\n
will be based on 1234\n
—it could go in the other order—but as a rule Git tries to keep the "most recent" objects in the "least compressed" form on the theory that newer history is accessed more often than older history.
Note that one object can be based on a previous object that itself is based on yet-previous objects. This is called a chain or delta chain: we must expand each object down to the base in order to apply each delta in turn in order to come up with the last object in the chain. Git's deltifier will limit delta chain lengths; see the --depth
argument to the various things that invoke it.
(In this particular case, the object ID is far longer than the object's content, so there is no benefit at all to trying to make a delta here. The principle applies to bigger files, though. Note also that delta compression should always be applied before any binary compression: delta compression relies on smaller Shannon entropy values in the input files, while compressors like gzip and bzip2 work by squeezing out such entropy.)
The format of pack files has changed several times. See also Are Git's pack files deltas rather than snapshots?
Upvotes: 1