Is it possible to maintain two local Git repositories with entirely different content, but push them to the same remote repository?

Question

I'm a graduate student experimenting with using Github for science. Often I might perform different tasks for the same project on different servers. I don't always want all of these functions on the same server at the same time, but it would be nice to keep them organized in one remote location.

Is it possible to git push two directories with different content to the same origin/GitHub repo? Basically, in this setup I would almost never pull content; just push it to keep myself organized and potentially share with collaborators. Or maybe use sparse-checkout to occasionally clone individual files/subdirectories.

torek · Accepted Answer

It is possible.

It's not particularly useful, though you can make it useful by careful discipline.

Git is fundamentally a two-part system: there's an object database, which is a simple key/value store; and there is a naming system, which is a second simple key/value store. The keys in the naming system are references, which are strings like refs/heads/master, refs/heads/branch, refs/tags/v1.2, and so on, and their values are hash IDs. The keys in the object database are hash IDs, and their values are Git objects.

It's the fact that the naming system uses arbitrary keys—well, arbitrary except for the leading refs/ requirement—that allows you to do what you're suggesting. It's the way that these keys' values work, and the way that the object database key/value system works, that makes it not so useful.

If you imagine the name keys as a simple flat table, you have:

name                value
------------------------------
refs/heads/master   1234567...
refs/heads/branch   fedcba9...

When you run git push, the final part of the git push operation is to deliver some name/value pairs to the server (on GitHub in this case) and ask them to set those name/value pairs in this table. They either accept the request to set those up, or refuse it (on a case-by-case basis, i.e., if you deliver two name/value pairs, they may accept one and refuse the other).

It's the hash-ID/object part of the database that is complicated. Each hash ID is unique to its particular object. There are four object types: commit, tree, tag, and blob. All share a common header encoding (the type name in ASCII, an ASCII blank, the object's size encoded as a decimal number with no leading zeros, and an all-zero-bits byte) followed by type-specific data. (The actual hash ID is just an SHA-1 hash of this header plus the type-specific data bytes.)

A blob object represents raw data, typically a file, so it just consists of raw bytes after the header. Two files with the same size and content have the same hash, i.e., are the same blob, so that if there is a file README and then a second file README~ that has the same content, there's only one repository blob for both files.

For Git to store files by name, it needs to store file names as well, so it does this with tree objects. A tree object has a special, well known internal format, which although it is binary, is essentially what you see when you run git ls-tree: for each file in a tree or sub-tree, Git stores a mode, a Git object hash ID, and a file name. (The object type is found by reading the object.)

For Git to store a snapshot in a commit, it needs to store a tree object that associates with that commit. So a commit object has, as part of its object data, the tree hash ID. Each commit also records its parent or parent hash IDs, so that the commit objects represent a Directed Acyclic Graph. The parent hash IDs act as the outbound edges E of the vertex V represented by the commit, and G = (V, E) is the commit DAG. Within this commit DAG, each commit points to one tree, which points to sub-trees and/or blobs, so Git can use the commit DAG to find commits, which find trees to obtain file names and blob IDs to extract files for those commits.

To store annotated tags, Git uses the last object type, which also stores a hash ID, saving the annotated tag's target object. Hence annotated tag objects point to commits (well, in general—Git allows each annotated tag object to point to any object type, though commits are the norm).

Because of all of this encoding, any Git object is either reachable from some other Git object (by traversing the commit graph from some appropriate starting point), or not reachable. Given a complete list of all Git object hash IDs, any breadth-first or depth-first search algorithm can be used to find which commits are reachable from some starting point(s), and which are not.

This is where we put this object database together with the reference-name/hash-ID-value database. All of these reference names, with their corresponding hash ID values, are the entry points into the graph. The commits that we can reach, from these starting-point hashes, are the commits that Git will retain. Any tags or commits that cannot be reached from these reference names become eligible for garbage collection. Any trees and blobs that lose all their references, through garbage-collection of the tag and/or commit objects, are also up for disposal.

Hence, if we have a graph that looks like this:

o <--o <--o   <-- refs/start1
 \
  o <--o    <-- refs/start2
   \
    *

then we start at the two starting points, mark those commits reachable, mark their parents (to the left) reachable, and mark their grandparent reachable. The grandparent has no parents—it's a root commit—so the process stops; and all the commits we didn't mark are unreachable and can be discarded. In this case, that's just the one commit *.

There's no need for the graph to be completely connected like this. We can have two disjoint subgraphs:

o--o--o   <-- refs/start1

o--o--o   <-- refs/start2

All of these commits are also referenced, and hence are safe from collection.

When you run git push, your Git calls up another Git and offers it, not just the name/value pairs, but also the commits (and other objects) identified by those name/value pairs. The receiving Git will ask for those commits if it does not have them yet, and then view their parent IDs and ask for those commits as well if necessary, and so on. The receiving Git will ask for additional tree and blob objects needed to make the graph complete. So if you have two independent Git repositories that have different names in them, and git push from each, you'll get the kind of disjoint subgraphs we see above.

But when you run git clone, the cloning Git asks the sending Git for (normally) all of its name/value pairs, and then all of the commits (and other objects) that are reachable from those values. So the one doing the clone gets all the disjoint subgraphs.

You can set up a Git repository so that it doesn't ask for all name/value pairs. This repository will, on git fetch (not git pull—that's just git fetch followed by a second Git command, so git fetch is the interesting part), only take some of the name/value pairs, and hence only take some of the commits and other objects. That would let you extract some or all of an independent sub-graph of the GitHub repository.

What you gain from doing this, vs using two different GitHub repositories, one for each sub-graph, is a lot of headaches when cloning and pushing: you cannot use a straight git clone or you will get everything, and when you push from separated repositories, you must be very careful to avoid reference-name collisions. So the end effect is that you make it hard for yourself to use, with essentially zero benefit.

Is it possible to maintain two local Git repositories with entirely different content, but push them to the same remote repository?

Answers (2)

Related Questions