Reputation: 52273
I am learning about Git, and it would be great if I had a description of the mathematical structure that represents a Git repo. For instance: it's a directed acyclic graph; its nodes represent commits; its nodes have labels (at most one label per node, no label used twice) that represent branches, etc. (I know this description is not correct, I'm just trying to explain what I'm looking for.)
Upvotes: 11
Views: 2273
Reputation: 488463
In addition to the links in Nevik Rehnel's comment (copied here per request: eagain.net/articles/git-for-computer-scientists and gitolite.com/gcs), and sehe's point that the commit graph forms a Merkle Tree, I'll add a few notes.
120000
(the file mode for a symlink), the file's "contents" are really the symlink target. Some mode value is (ab)used for submodules, but I forget which. R and W mode bits are not stored, only X bits (and even then they're ignored if the repo configuration says to ignore them).git commit --allow-empty
) it uses that empty tree. (Since the empty tree has no sub-objects, its SHA-1 hash value is a constant.)git gc
. The empty tree appears to be immune to collection. Anything in the refs/
and logs/
directories and the file packed-refs
(in .git
, or for bare repos or when $GIT_DIR
is set, wherever else) acts as a reference, as do the special names (HEAD
, ORIG_HEAD
, etc.); I'm not sure if other random files, if created in .git
and containing valid SHA-1s, would act as references, or not.git add
a file, git drops the file into the object-store and places the (non-text) SHA-1 hash into the index file. These are valid references that prevent garbage collection.Upvotes: 10
Reputation: 393134
I think the most relevant answer would need to include the most important characteristic of Git revision trees: cryptographic signature (each revision includes the hash of parent revision and commit details).
This is known as a Merkle Tree: http://en.wikipedia.org/wiki/Merkle_tree
See an earlier answer for some background: (Git: How to treat commit so that versions of a file exist in their entirety (not just as diffs))
Background
Storing deltas was popularized by RCS, CVS, Subversion and others (SourceSafe?). Mainly, because the model made it easy to transfer changesets because they would already be in delta form. Modern VCS-es (mostly distributed) have evolved away from that, and put the emphasis on data integrity.
Data Integrity
Because of the design of the object database, git is very robust and will detect any corrupted bit of data anywhere in a snapshot, or the entire repo. See this post for more details on the cryptographic properties of Git repositories: Linus talk - Git vs. data corruption?
In techno babble: commit histories form cryptographically strong merkle trees. When the sha1 sum of the tip commit (HEAD) matches, it mathematically follows that
- tree content
- the branch history (including all sign-offs and committer/author credentials)
are identical. This is a huge security feature of git (and other SCMs that share this design feature)
Upvotes: 6