git: file history across forks

Question

My company recently migrated from Clearcase to GIT. We are wondering if there is a way to display file history across forked repos? Our development consists of a core product repo and a series of 'project forks'. Maintainers and developers a like would like the ability to see individual file changes from all the forks of a particular repo (the parent). this would be a bit like the 'Version tree' feature in CC. We have many forks and it would not be practical for each dev to make a clone of them all and then search each clone. We use Bitbucket 5.3

torek · Accepted Answer

Git does not have "file history". Git has commits, and commits are history, because each commit has some set of parent commits. When we—or Git—link a commit to its parent(s), and then use its parent(s) to link to more parent(s), we get a graph:

A <-B <-C   <--master
     \
      D   <--develop

Here, the name master selects commit C, whose parent is commit B. The name develop selects commit D, whose parent is also commit B. Commit B has commit A as its parent, and because this repository is so tiny, commit A is the very first commit and has no parent at all.

Therefore, the history in this repository is that C leads to B which leads to A, and D leads to B which we already saw. That is the history. That's all there is ... except ...

Well, we know that each commit is a snapshot of all files. So commit C has some set of files in it. Commit B also has a snapshot of all files. If the set of files in commit C match those in commit B, except for one specific file such as README that's different, why then, we can synthesize a history for file README: it was changed in C with respect to its parent B.

The file is just there, in C, and in B, but it's different in those two commits. The two commits are linked—B is C's parent (which implies that C is B's child, though we have to compute that as Git stores only the backwards links). So this allows us to synthesize a file history: we look at the commit history, and extract information about the file we care about. If it changed from one child back up to its parent, we can claim that this is an interesting change, recording the child and parent IDs (and if the child has just one parent, we don't even need to record the parent ID).

Adding a new commit just adds more history

Suppose we have the above four-commit repository, and we clone it so that we can work on it. In our clone we check out branch develop, so that commit D is the current commit and our HEAD is attached to the name develop:

A--B--C   <-- master
    \
     D   <-- develop (HEAD)

We now edit file README, run git add README, and run git commit. Our Git makes a new commit E, which acquires a new, unique, big ugly hash ID, and stores that in our repository. It sets the parent of E to D—the commit that was current when we ran git commit—and then stores E's hash ID in our name develop, giving us:

A--B--C   <-- master
    \
     D--E   <-- develop (HEAD)

Our name develop now points to commit E rather than commit D.

If we now wish to examine the history of file README, we must start at both commit E and commit C and work backwards. Did README change from E to D? Yes, so commit E (parent D) is interesting. Did README change from C to B? Yes, so commit C (parent B) is interesting. We must repeat this for every commit in this repository, and that gives us the synthesized history of file README.

About "forks"

A fork is just a clone of some other repository, but one made with some sort of intent, usually to feed changes back to the original repository and/or to pick up changes from the original repository. In order to do that, the fork contains a reference to the original repository, the same way any clone typically contains a reference back to its original repository.

In clones, this reference back to the original has a name, which Git calls a remote. The standard name is origin. Git uses this to automatically pick up new commits from the original repository, using git fetch, and to send new commits made in the clone to the original repository, using git push. The fetch step picks up commits made since the last fetch, or since the original clone. That is, these are commits they have that you don't. A push step gives them commits you have that they don't. Again, we see that what Git has, and cares about, are the commits.

When you use a web server's "fork a repository" web-page button, the server itself records, in some behind-the-scenes way, the fork. Whether and how you can retrieve those records, to find the appropriate server URLs for each such clone, is up to the server. If you can find them all, though, you can simply add URL to some existing clone, one additional remote name per URL, and then use git fetch --all to fetch from all remotes (all the server clones) into the one clone.

The true name of a commit is its hash ID

In the Git universe, ever commit has a unique hash ID, different from every other commit. The way two Git repositories decide whether one of them has a commit that the other doesn't is by comparing hash IDs, so these must be unique. Without going into a lot of detail, this really just works: two commits are the same if their hashes match, and different if not. This means that no matter how long it has been since two repositories diverged, if they shared some set of commits at some point by virtue of one having been cloned from another, they will share some commits.¹ The histories connect at that point. All unique subsequent commits will, of course, be unique, though if someone who made a fork (i.e., a clone) somehow transferred that commit back to one of the other forks (i.e., some other clone), those two clones will share the pushed-back commit.

¹There is a caveat: this assumes that no one "rewrote history" by copying every commit to a new, different commit with a different hash ID, then stopped using all the originals. In that case, the two repositories no longer share any commits, and the histories will not connect.

Each clone is complete and independent of all other clones, except when you cross-connect them

While it's true that any commit whose hash ID matches is the same in all clones (by definition), each clone is independent of every other clone at all times. By this, we mean that whoever controls that clone can add new commits to it, or change its branch names so that those names refer to different commits. For instance, after we add commit E to our repository, we can remove commit E again, using git reset. If we do that, and don't push our new commit E anywhere, no one else will ever see that we made it. The change we made to README in commit E (with respect to its parent D) vanishes:

A--B--C   <-- master
    \
     D   <-- develop (HEAD)
      \
       E   [abandoned]

This is a feature of distributed repositories. Commits exist only where they exist—tautological, but true. Once a commit is sent from our clone to some other repository, the commit exists in two repositories. Its unique hash ID is now in both. The only way to get rid of it now is to make sure that both repositories have whatever name(s) find commit E changed so that those names no longer find commit E. Usually we don't try to remove commits unless they have not been given to anyone else, though, because it does require this go to every clone we gave the commit to, and make sure they toss out this commit too kind of action. Hence, once they are published on a generally-accessible clone such as a web fork, commits tend to be persistent.

Git in general is very sticky-fingered with commits. If we cross-connect our clone with commit E in it (reachable from the name develop) to some other clone, and have the other Git do a fetch, their Git will ask about all the commits that we have that they don't, and will pick up the new commit and give it a name like ourclone/develop. So git fetch-ing from every fork will pick up every commit available in every repository, giving us a sort of super-set clone.

This super-set clone will let you do what you want

Once we have a super-set that has every fork mushed together, we can have our Git find every interesting commit, starting from every remote-tracking name (forkA/master, forkA/develop, forkB/master, and so on), looking for child commits with a README that differs from their parent commit(s). Since the hash IDs are universal, we can now tell, looking at any one fork, whether that fork has that particular version of that file. But we do have to build a fairly massive combined commit history to see precisely where each instance of that file is in the actual history, which is the set of all commits, because files don't have history.

Note that you don't actually have to build the super-set clone—but you do need access to every commit, by its hash ID, to see which version of README is in that commit, and see what that commit's parent(s) is/are, and which version of README is in the parent(s). This means you must have access to every fork, and see all commits in that fork. The work you do to build all this information is the same as the work you would do to build the super-set clone, so you might as well build the super-set clone.