Bruce Adams
Bruce Adams

Reputation: 5581

git: rewriting, moving and tracking history safely

I forked a repository to which I wanted to add content at a higher level. So I googled how to move the contents of a repo down a level. The standard recommendation seems to be to use filter-branch For example, one of the answers was:

How can I rewrite history so that all files, except the ones I already moved, are in a subdirectory?

There is perhaps better advice here:

Move file and directory into a sub-directory along with commit history

To my mind applying git mv to each file in the repository makes more sense.There isn't a builtin to do this recursively so you have to do something like:

find . -type f > files
find . -type d | xargs -idir mkdir -p subdir/dir
cat files | xargs -ifile git mv file subdir/file

I am wary of changes that rewrite history as opposed to repairing it. I would expect any rewriting of history with filter-branch to cause problems if you try to fetch from (sync with) the original upstream project. Hopefully git will be able to understand (and merge) commits based on moves better.

If that's the case why is filter-branch recommended more often (my perception - which could be wrong) and why haven't more powerful variants of git mv (e.g. recursive) bubbled to the surface yet? (Q1)

I understand using filter-branch for example to delete sensitive data like passwords for all versions of the repository. However, it seems like a very bad practice to recommend hiding major (deliberate rather than accidental) changes to the repository.

Are there recommended equivalents to filter-branch (or other best practices) that do allow history to be tracked for deliberate changes? (Q2)

Clarification: The history does not have to be attached to the same entity (file) but it must be trackable e.g. using git log --follow.

Upvotes: 2

Views: 398

Answers (1)

torek
torek

Reputation: 487983

What you're asking for is technically impossible in Git. The reason is simple enough, although rather self-entangled:

  • There are four kinds of objects in the repository: commits, trees, blobs (files), and annotated tags. Each object has a unique identifier, represented as a 40-character SHA-1 hash, such as 7c56b20857837de401f79db236651a1bd886fbbb.1 The repository is basically a key/value store, with the hash ID being the key and the contents of the object being the value.

    The unique ID depends entirely on the contents of the object, and in fact, is formed by hashing the object (prefixed with a tiny header giving the object's type and size). This means, for instance, that the hash of a file containing one line with just the word hello is ce013625030ba8dba906f756967f9e9ca394464a. Every file in the universe that consists of that one line has that one same hash.2

    In other words, the uniqueness of the hash depends on the uniqueness of the object. Use the same object again and you get the same hash. Use a different object, and you get a different hash. At the bottom level, Git is simply this same key/value store: give it a key (which you must somehow, magically, know), and it gives you back a value whose hash is that key.

  • A commit object records five items as its value:

    1. A single tree ID: the source tree for that commit. (The tree itself is also stored as an object, but multiple commits can re-use one tree. For instance, if you make a commit, then immediately make a revert-commit for that commit, the reverted commit reuses the original tree. That is, we start with tree T1; we make a new commit with tree T2; then we make the revert commit and it has tree T1 again. It's a different commit but it stores the same source tree.)
    2. A list of parents (hash IDs of each parent of that commit). The list can be empty, representing a root commit; have one ID, being a regular commit; or have two or more IDs, making this a merge commit.
    3. An author: a full name, an email address, and time-stamp giving who wrote the commit and when.
    4. A committer. This is same idea as the author, but allows one person to use another's code and give both credit. The committer is the jerk who put the author's terrible code into the repository. :-) (The idea here is to allow for emailed patches, and cross-repository pulls upon pull-request.)
    5. The commit message. Git itself does not enforce any format for this, although the short-subject followed by longer-body is the recommended standard and git log has tools to extract these parts.
  • Last, the clincher: History is commits.

The history within a repository is the set of commits in the repository: nothing more, nothing less. The way you see history is to start with a reference, such as a branch name or a tag, which is simply a way Git provides you to turn a human-friendly string like master or v2.2.1 into a hash ID. That gets you the last (or tip) commit. The tip commit has one or more stored parent IDs, which get you the next bit of the history, and those commits have more parents, and this lets you move backwards through the history.

Since the parent and tree lines are part of a commit object, if you ever want to make any change to any commit, anywhere in the history stored in a repository, you must make a new and different commit. Even if you preserve the author and committer name+email+timestamps exactly, even if you preserve the exact message, if you've changed the tree in any way, you get a new, different commit, with a new, different hash ID.

Then, since you've made a new commit that belongs somewhere back along a chain of commits, you must re-copy every subsequent commit. You must re-copy the child3 of the commit in order to put in a new parent line. This produces a new, different hash for the child, so now you must re-copy its child, which is your commit's grandchild. This forces you to re-copy the great-grand-child, and so on, all the way to the tip.


1This is actually the tag v2.2.1 in the Git repository for Git itself. It's theoretically possible that this same ID will be assigned to another, different Git object, somewhere in the universe, as long as that different Git object is never used a clone of the Git repository for Git. In general, though, no ID ever repeats anywhere unless the contents are bit-for-bit identical; and it's quite critical that no ID ever repeat like that within a single repository—in fact, Git literally can't store two different objects in one repo using the same ID.

2If Git ever changes hash algorithms, this is going to cause a fair bit of pain. Mercurial also uses SHA-1, but Mercurial deliberately left room to switch to SHA-256, and is much better at hiding the internal hashes. Git exposes the hashes too easily, and tree objects have no space for larger hashes, so the transition will be more disruptive.

3Or children, if the commit has more than one child. Note that finding children is hard, as commits only record their parents. Git must traverse the entire commit graph to find all children of a given commit. Usually, it doesn't even bother: most cases don't need to, and some cases that seem to need to, can get away with just finding a subset of children. Git has an unfortunate tendency to leave it to you, the user, to figure out when it really matters that you find all children, and to make you force Git to do that.


So what does git filter-branch do then?

The answer is simple enough: git filter-branch copies commits.

The filter-branch script does its best to preserve original commits as bit-for-bit identical copies. If it can copy a commit this exactly, then the new copy has the same ID as the original and thus is the original. But if anything has changed—in the tree, or in terms of a parent ID—then the new copy has a new, different ID.

Filter-branch does this copying by first listing every ID to copy into a file. Then it goes through this file, in "parents before children" order. It extracts the commit to be copied, applies all your filters, and makes a new commit from the result. If the new commit is bit-for-bit identical, the "new" commit simply shares the old one; otherwise it has a new, different ID.

The filter-branch command also makes up a mapping file: "old ID was X, new ID is Y". Each new commit simply adds a new mapping: X and Y are equal if the commit was in fact bit-for-bit identical, otherwise they are different. And, of course, you can skip some commits (using the --commit-filter argument), which makes the skipped commit map to the most recently not-skipped commit: this is the "remap to ancestor" concept that shows up in the documentation.

When filter-branch finishes, it rewrites some or all of the references (branch names and optionally tag names as well—it probably should default to including tags, really) using the accumulated mappings.

Note that after filtering, you have both sets of history in your repository: the original commits, saved in refs/original/refs/heads/master for instance, and the new copies as pointed to by the rewritten refs/heads/master for the master branch.

The uniqueness of IDs and chaining of history provide Git's security

Although Git itself is not meant to be cryptographically secure, note that you can GPG-sign annotated tags. These GPG signatures authenticate only the one specific signed object, i.e., only the tag itself. The tag, however, literally contains the ID of the target commit, so you have in effect certified that the corresponding commit is good and valid, containing no Trojan horses, backdoors, viruses, or other Bad Things™. And, since that commit contains its parent commit ID, you have also signed off on the parent, and its parent, and so on all the way back through history.

When you use filter-branch and have it copy tags, it cuts off the signatures, since they are no longer valid: they point to altered, copied commits. If you want the copies signed, you must do that manually. (This might be why filter-branch does not copy tags by default. The problem is that it throws away the commit ID map file when it's done, so now it's too late: it would be better to copy the tags, removing the signatures in the process, and then let you replace the copies with signed copies.)

(You can also GPG-sign individual commits. This works poorly with filter-branch, and is in general a big nuisance anyway.)

Git's "notes"

While this has little to do with altering commit history, it's a good idea to mention Git's "notes" here. Notes are an alternative solution to the two-part problem that (a) commits are immutable, but (b) we'd like to be able to make a commit, then later mark up that commit in some fashion, e.g., to say that it has passed some automated tests, or been Inspected by Number 42, or whatever.

A "note" is simply a file4 that gets attached to a commit-ID. This file is stored separately from the commit history, in a "notes history": a chain of commits whose tip is stored in refs/notes/commits (well, refs/notes/ anyway, the commits part is a configurable default and you can have multiple sets of notes). Git has a small set of commands to let you attach a note to a commit, and by default, git log will check each commit, by its hash ID, to see if there are notes for it.

Since the notes are separate files that just refer back to the commit hashes, you can update those files, and therefore update the notes attached to any given commit.

Of course, filtering changes the commit IDs, which loses the linkage between the notes and the commits. It would be possible (albeit nontrivial) for filter-branch to update the notes, but it doesn't do that now.


4The "file name" of a commit note is in fact the ID of the commit itself, modified slightly for faster lookup. The modification is similar to the way objects are stored in .git/objects: an object whose hash ID is 12345... gets stored in .git/objects/12/345.... A commit note gets tree-structured, with the tree depth being variable, rather than a simple first-two/all-the-rest fanout, so it's somewhat tricky. The front end git notes interface hides all this pretty well, though.

Upvotes: 2

Related Questions