Reputation: 2666
I'm a graduate student experimenting with using Github for science. Often I might perform different tasks for the same project on different servers. I don't always want all of these functions on the same server at the same time, but it would be nice to keep them organized in one remote location.
Is it possible to git push two directories with different content to the same origin/GitHub repo? Basically, in this setup I would almost never pull content; just push it to keep myself organized and potentially share with collaborators. Or maybe use sparse-checkout
to occasionally clone individual files/subdirectories.
Upvotes: 3
Views: 2033
Reputation: 488193
It is possible.
It's not particularly useful, though you can make it useful by careful discipline.
Git is fundamentally a two-part system: there's an object database, which is a simple key/value store; and there is a naming system, which is a second simple key/value store. The keys in the naming system are references, which are strings like refs/heads/master
, refs/heads/branch
, refs/tags/v1.2
, and so on, and their values are hash IDs. The keys in the object database are hash IDs, and their values are Git objects.
It's the fact that the naming system uses arbitrary keys—well, arbitrary except for the leading refs/
requirement—that allows you to do what you're suggesting. It's the way that these keys' values work, and the way that the object database key/value system works, that makes it not so useful.
If you imagine the name keys as a simple flat table, you have:
name value
------------------------------
refs/heads/master 1234567...
refs/heads/branch fedcba9...
When you run git push
, the final part of the git push
operation is to deliver some name/value pairs to the server (on GitHub in this case) and ask them to set those name/value pairs in this table. They either accept the request to set those up, or refuse it (on a case-by-case basis, i.e., if you deliver two name/value pairs, they may accept one and refuse the other).
It's the hash-ID/object part of the database that is complicated. Each hash ID is unique to its particular object. There are four object types: commit, tree, tag, and blob. All share a common header encoding (the type name in ASCII, an ASCII blank, the object's size encoded as a decimal number with no leading zeros, and an all-zero-bits byte) followed by type-specific data. (The actual hash ID is just an SHA-1 hash of this header plus the type-specific data bytes.)
A blob object represents raw data, typically a file, so it just consists of raw bytes after the header. Two files with the same size and content have the same hash, i.e., are the same blob, so that if there is a file README
and then a second file README~
that has the same content, there's only one repository blob for both files.
For Git to store files by name, it needs to store file names as well, so it does this with tree objects. A tree object has a special, well known internal format, which although it is binary, is essentially what you see when you run git ls-tree
: for each file in a tree or sub-tree, Git stores a mode, a Git object hash ID, and a file name. (The object type is found by reading the object.)
For Git to store a snapshot in a commit, it needs to store a tree object that associates with that commit. So a commit object has, as part of its object data, the tree hash ID. Each commit also records its parent or parent hash IDs, so that the commit objects represent a Directed Acyclic Graph. The parent hash IDs act as the outbound edges E of the vertex V represented by the commit, and G = (V, E) is the commit DAG. Within this commit DAG, each commit points to one tree, which points to sub-trees and/or blobs, so Git can use the commit DAG to find commits, which find trees to obtain file names and blob IDs to extract files for those commits.
To store annotated tags, Git uses the last object type, which also stores a hash ID, saving the annotated tag's target object. Hence annotated tag objects point to commits (well, in general—Git allows each annotated tag object to point to any object type, though commits are the norm).
Because of all of this encoding, any Git object is either reachable from some other Git object (by traversing the commit graph from some appropriate starting point), or not reachable. Given a complete list of all Git object hash IDs, any breadth-first or depth-first search algorithm can be used to find which commits are reachable from some starting point(s), and which are not.
This is where we put this object database together with the reference-name/hash-ID-value database. All of these reference names, with their corresponding hash ID values, are the entry points into the graph. The commits that we can reach, from these starting-point hashes, are the commits that Git will retain. Any tags or commits that cannot be reached from these reference names become eligible for garbage collection. Any trees and blobs that lose all their references, through garbage-collection of the tag and/or commit objects, are also up for disposal.
Hence, if we have a graph that looks like this:
o <--o <--o <-- refs/start1
\
o <--o <-- refs/start2
\
*
then we start at the two starting points, mark those commits reachable, mark their parents (to the left) reachable, and mark their grandparent reachable. The grandparent has no parents—it's a root commit—so the process stops; and all the commits we didn't mark are unreachable and can be discarded. In this case, that's just the one commit *
.
There's no need for the graph to be completely connected like this. We can have two disjoint subgraphs:
o--o--o <-- refs/start1
o--o--o <-- refs/start2
All of these commits are also referenced, and hence are safe from collection.
When you run git push
, your Git calls up another Git and offers it, not just the name/value pairs, but also the commits (and other objects) identified by those name/value pairs. The receiving Git will ask for those commits if it does not have them yet, and then view their parent IDs and ask for those commits as well if necessary, and so on. The receiving Git will ask for additional tree and blob objects needed to make the graph complete. So if you have two independent Git repositories that have different names in them, and git push
from each, you'll get the kind of disjoint subgraphs we see above.
But when you run git clone
, the cloning Git asks the sending Git for (normally) all of its name/value pairs, and then all of the commits (and other objects) that are reachable from those values. So the one doing the clone gets all the disjoint subgraphs.
You can set up a Git repository so that it doesn't ask for all name/value pairs. This repository will, on git fetch
(not git pull
—that's just git fetch
followed by a second Git command, so git fetch
is the interesting part), only take some of the name/value pairs, and hence only take some of the commits and other objects. That would let you extract some or all of an independent sub-graph of the GitHub repository.
What you gain from doing this, vs using two different GitHub repositories, one for each sub-graph, is a lot of headaches when cloning and pushing: you cannot use a straight git clone
or you will get everything, and when you push from separated repositories, you must be very careful to avoid reference-name collisions. So the end effect is that you make it hard for yourself to use, with essentially zero benefit.
Upvotes: 3
Reputation: 45649
It can be done; but it may not really gain you much vs. just maintaining two remote repos.
Basically you'd have a branch (or set of branches) for each distinct set of content. You could then use refspecs to control which (set of) branch(es) to map in a given repo.
So for example, you could create a repo like this
git init
git checkout --orphan content-A/master
# stage content for the "A" clones
git commit -m "Init A"
git checkout --orphan content-B/master
# stage content for the "B" clones
git commit -m "Init B"
Now you can push that to a github repo, and anyone can clone it and then choose which set of branches to check out.
If you want to get fancy, and ensure that an "A clone" doesn't have the B history in its database at all (i.e. if each set of content with history would take up a lot of disk space and you want to only keep what you need), that can also be done. You have to get around git's tendency to copy the entire repo. One way would be
git clone --single-branch --branch=content-A/master url/of/remote/repo
This sets up a suitable refspec for just pulling that one branch. If each content-x/
namespace contains multiple branches, you have to tweak the configuration a little; something like
git config remote.origin.fetch +/refs/heads/content-A/*:/refs/remotes/origin/content-A/*
Upvotes: 3