max
max

Reputation: 52273

What is the mathematical structure that represents a Git repo

I am learning about Git, and it would be great if I had a description of the mathematical structure that represents a Git repo. For instance: it's a directed acyclic graph; its nodes represent commits; its nodes have labels (at most one label per node, no label used twice) that represent branches, etc. (I know this description is not correct, I'm just trying to explain what I'm looking for.)

Upvotes: 11

Views: 2273

Answers (2)

torek
torek

Reputation: 488463

In addition to the links in Nevik Rehnel's comment (copied here per request: eagain.net/articles/git-for-computer-scientists and gitolite.com/gcs), and sehe's point that the commit graph forms a Merkle Tree, I'll add a few notes.

  • There are four object types in the object-store: commit, tree, annotated-tag, and blob (file).
  • A commit object contains exactly one tree-ref (which of course can point to more trees), a possibly-empty list of parent SHA-1 hashes (which must all be more commits), an author (name, email, and timestamp), a committer (same form as author), and the commit text.
  • A tree object contains a list of (mode, sub-object, filename) repeated 0-or-more-times. If the sub-object is another tree the filename represents a directory. If it's a blob, it represents a file. The mode looks like a POSIX file mode and if it's 120000 (the file mode for a symlink), the file's "contents" are really the symlink target. Some mode value is (ab)used for submodules, but I forget which. R and W mode bits are not stored, only X bits (and even then they're ignored if the repo configuration says to ignore them).
  • An annotated-tag object contains an object reference, a tagger (name, email, and timestamp), and the tag text. The referenced object is normally a commit but a tag object can point to any object (even another tag object).
  • The labels (branches and tags and reflog-references and so on) live outside the object-store. For annotated tags, there's a label outside, pointing to the annotated tag object inside the object-store. For a lightweight tag, the outside label points right to a commit.
  • There is no restriction that there be only one root commit. Any commit with no parents is a root.
  • Git almost never makes an empty tree (which would represent an empty directory), except for two cases: there's an empty tree at all times in every repo, and if you make an initial empty commit (with git commit --allow-empty) it uses that empty tree. (Since the empty tree has no sub-objects, its SHA-1 hash value is a constant.)
  • The "DAG" description is generally meant for the trees formed by closing over commit parent hashes. However, a tree object should in general not contain itself in any of its subtrees, and if you managed to make a cyclic tree structure you would not be able to check it out (because it recurses infinitely). Assuming you cannot make two different trees with the same checksum (if you could you'd break git), you won't find a tree T1 that contains a tree T2 that contains a different tree whose checksum is T1. So the trees are implicitly a DAG too, and being attached to commit-DAGs, they form a bigger DAG. :-)
  • Unreferenced objects in the object-store will get garbage-collected by git gc. The empty tree appears to be immune to collection. Anything in the refs/ and logs/ directories and the file packed-refs (in .git, or for bare repos or when $GIT_DIR is set, wherever else) acts as a reference, as do the special names (HEAD, ORIG_HEAD, etc.); I'm not sure if other random files, if created in .git and containing valid SHA-1s, would act as references, or not.
  • The index has some format I've never dug into. It contains references to objects in the object store. When you git add a file, git drops the file into the object-store and places the (non-text) SHA-1 hash into the index file. These are valid references that prevent garbage collection.

Upvotes: 10

sehe
sehe

Reputation: 393134

I think the most relevant answer would need to include the most important characteristic of Git revision trees: cryptographic signature (each revision includes the hash of parent revision and commit details).

This is known as a Merkle Tree: http://en.wikipedia.org/wiki/Merkle_tree


See an earlier answer for some background: (Git: How to treat commit so that versions of a file exist in their entirety (not just as diffs))

Background

Storing deltas was popularized by RCS, CVS, Subversion and others (SourceSafe?). Mainly, because the model made it easy to transfer changesets because they would already be in delta form. Modern VCS-es (mostly distributed) have evolved away from that, and put the emphasis on data integrity.

Data Integrity

Because of the design of the object database, git is very robust and will detect any corrupted bit of data anywhere in a snapshot, or the entire repo. See this post for more details on the cryptographic properties of Git repositories: Linus talk - Git vs. data corruption?

In techno babble: commit histories form cryptographically strong merkle trees. When the sha1 sum of the tip commit (HEAD) matches, it mathematically follows that

  • tree content
  • the branch history (including all sign-offs and committer/author credentials)

are identical. This is a huge security feature of git (and other SCMs that share this design feature)

Upvotes: 6

Related Questions