TLOlczyk
TLOlczyk

Reputation: 443

how are commit DAGS stored in GIT

How are DAGs stored internally in Git? As an example, consider the DAG A->B->C->D ->E->F->G.

You somehow to save the following information. A->B, B->C,C->D,A->E,E->F,F->G. So how is it stored? Given a particular node how can you tell what branch it is on?

Upvotes: 0

Views: 879

Answers (2)

ElpieKay
ElpieKay

Reputation: 30958

Each commit has N parents, with N an integer that is equal to or bigger than 0. A root commit has 0 parents. A non-merge commit has 1 parent. A merge commit has 2 or more parents. A parent is another commit. Each commit except the root stores the sha1(s) of its parent(s). Thus the commits are organized as a DAG. Knowing a commit, we can tell all of its ancestors. git log -1 <commit> --pretty=%P can output the commit's parent(s).

In one git repo, there may be one or more DAGs. A branch is a ref that points to, or an anchor that attaches to a commit (or another branch) . It's a bit like the pointer in C, with the commit as a struct or class variable and sha1 as its address. The branch can be moved from one commit to another. Sometimes it moves automatically, via introducing a new commit including git commit, git rebase, git merge, git cherry-pick, etc. Sometimes it moves by our will via git reset.

When we say a commit is on a branch, it can be understood in another way, that this commit is equal to, or the ancestor of that commit that is referred to by the branch. git branch --contains <commit> --all can tell which branches the commit is on.

Upvotes: 2

torek
torek

Reputation: 489828

Just so that we don't get lost in terminology: A DAG in general is a graph G = (V, E), composed of a vertex set V and edge set E, where each edge in the edge-set imposes a direction (also called an arc), and there are no cycles (no path from any vertex back to itself through the various arcs). A typical representation of edges is as pairs of vertices, e.g., if the nodes are represented as single uppercase letters (as in your example) we might have <A, B> in the edge-set E to indicate that vertex A connects to vertex B, in that direction. That is, <A, B> means there is an arc from A to B.

Git does not use this typical representation. Instead, each "vertex" is a commit, whose unique identifier is its hash ID (and I, at least, tend to call these "nodes" in the graph instead of "vertices"). Each commit lists its parent commit(s), by their hash IDs. Hence, if commit A (really a234567890123456789012345678901234567890 or some such) is the parent of commit B (really b876543210...), there's nothing in A naming B, but there is a parent ID in B naming A.

In other words, the edges in a Git graph are all backwards.

Meanwhile, branch names point to a single commit node, which is designated the tip commit of that branch. For instance master may resolve to 08bb3500a2a718c3c78b0547c68601cafa7a8fd9.

The name HEAD either contains the current branch name, or contains the raw hash ID of the current commit. Using git rev-parse, we can turn any name (including HEAD) into the appropriate ID:

$ git rev-parse HEAD
08bb3500a2a718c3c78b0547c68601cafa7a8fd9

We can now answer these questions:

So how is it stored?

A commit node exists in the repository as an object of type commit whose contents are (after the usual compression is expanded) just plain text in the format shown by git cat-file -p:

$ git cat-file -p HEAD | sed 's/@/ /'
tree a775288b86ae652ea163357939d852cdd927eed6
parent 36cafe44443fcca9eb35399ef0e9bfe289ec5dde
author Junio C Hamano <gitster pobox.com> 1468959976 -0700
committer Junio C Hamano <gitster pobox.com> 1468959976 -0700

Sixth batch of topics for 2.10

Signed-off-by: Junio C Hamano <gitster pobox.com>

This tells us that there is an arc from 08bb3500a2a718c3c78b0547c68601cafa7a8fd9 to 36cafe44443fcca9eb35399ef0e9bfe289ec5dde.

To find the complete graph—the set of all edges and vertices/nodes—we begin with all suitable starting points (see below) and read these objects from the repository. For commit objects, we read their parent lines, which provide additional node IDs and also provide an arc: from the node we just read, to the node named in each parent line. (A merge commit has multiple parent lines, rather than one parent line listing multiple IDs, but of course that's pretty trivial. Note also that each annotated tag object points to one other object, usually a commit, so when we find an object of type tag we should read its object line, and repeat this process until we find a non-tag object. But we won't find any such objects if we start only from branch names; see below.)

(In a normal non-Git DAG, no arc is especially distinguished, but in Git, the first parent listed for each node is distinguished from any additional parents, and the order matters when using the suffix-^ syntax. In particular, when you make a merge commit, the ID that used to be HEAD will become the first parent of the new merge commit.)

Given a particular node how can you tell what branch it is on?

This question has a flaw: it assumes that a node is on a branch.

A node may, in fact, be on no branch, or on many branches.

Let's return now to the notion of all suitable starting points. What starting-points are there? If we had a typical graph representation, we would have the full set of vertices (or nodes) listed somewhere. In Git, we don't have this.1 Instead, we have references, which are mostly names starting with refs/. Branches and tags are forms of references, starting with refs/heads/ and refs/tags/ respectively. The Git command git for-each-ref lets you find all of these references.

There are a few special purposes references that do not start with refs: HEAD, MERGE_HEAD, CHERRY_PICK_HEAD, ORIG_HEAD, and so on. Some Git commands need to look here as well. For your particular case, however, we care only about branch names, and all of those do start with refs/—in fact, with refs/heads/—so we can run git for-each-ref refs/heads to list them all. (The for-each-ref command adds the extra / for us here, on the theory that it's like a directory-list operation.)

Hence, to find out if node X (for some X) is on one or more branches, and if so, which ones, we start from the node ID stored under each branch name. That identifies the tip commit of that branch. We then follow that commit's parents, and those nodes' parents, and so on until we run out of parents (using whatever search algorithm we like). If we encounter node X along the way, node X is on that branch.

Thus, node X is contained in each branch from which we can find X starting from that branch's tip commit. This is what git branch --contains shows.

(Tag names generally point either directly to commit nodes (a "lightweight tag") or to a tag object (an "annotated tag"). So if we allow all references, we must be prepared to handle tags. Branch names are constrained to point only to commit nodes.)


1Well, we can do an exhaustive traversal of the entire repository to look for all objects. This is what git fsck does, for instance, or git gc.

Upvotes: 4

Related Questions