Reputation: 408

What causes git to rewrite a commit rather than reusing the existing one?

I've implemented a workflow to manage configuration files as follows:

production is functionally equivalent to what is on the live production server, as the server performs a checkout of production routinely after verifying nothing has changed locally.
pre-production is functionally equivalent to what is on the live pre-production server (see production).
development is effectively equivalent to what is on the live development server (see production).
master is a queue of things ready to merge to production. If nothing is queued, it is pointed at the same commit as production.
Every time a commit is made to master, rebase -p master --no-ff is issued for pre-production and development.
Every time a commit is made to production, rebase -p production --no-ff is issued for master. A tag, unity, is force-updated to this commit point. The commit to production rebases master, and the rebase of master forces both pre-production and `development to rebase.
New feature/* branches are always created from the latest unity point (could also be from production, but this is primarily to reduce confusion for users who may accidentally track the production branch this way).

We've been using this workflow in production for a few weeks now and have ironed out most of the kinks. One of the oddities that I've noticed is that some of the merges to pre-production modify the commit being merged, while others don't.

For example:

unity   merge feature/foo to pre-production
|       |
A------>C
 \     /
  \-->B   feature/foo

            unity (merge feature/bar to master, merge master to production)
            |   merge feature/foo to pre-production
            |   |
A---------->D-->E
 \-->B     /   /
  \-------/-->B'
   \---->C

            unity (merge feature/bar to master, merge master to production)
            |   merge feature/foo to pre-production
            |   |   merge feature/baz to pre-production
            |   |   |
A---------->E-->F-->G
 \-->B     /   /   /
  \-------/-->B'  /
   \---->C       /
    \---------->D

            merge feature/bar to master, merge master to production
            |   unity (merge feature/qux to master, merge master to production)
            |   |   merge feature/foo to pre-production
            |   |   |   merge feature/baz to pre-production
            |   |   |   |
A---------->E-->F-->G-->I
 \-->B     /   /   /   /
  \-------/---/-->B'  /
   \---->C   /       /
    \-------/------>D
     \-----H

If I review the pre-production history, this is approximately what I see on a simple scale (some of the branches may have many multiple commits, some may have one or two). I'm also leaving out master, because it's generally at the exact same commit as production, including any "master => production" commits.

What I don't understand is why B' (a duplicate of feature/foo, but not attached to the feature/foo branch) exists, with a modified commit date, while D (feature/baz, both in reality and in merge with pre-production) can exist as-is, through multiple rebase procedures.

If there was a way to force the feature/baz functionality during a rebase, that would be preferred, although it's not really an issue since the whole problem goes away once a branch is moved to production or abandoned/deleted. I'm mostly interested in trying to understand the "why" in how git handles this, and if there's a way to force one path over the other, either way.

Upvotes: 2

Answers (1)

torek

Reputation: 490088

(I'm afraid this is long and not really a direct answer—it's another one of my long blog-post-type answers written between other things.)

Getting to "why" is a little tricky. First, let's look at "what".

Rewrite? Reuse? Neither and both!

In an important and fundamental sense, Git never rewrites commits, and yet in another, it can reuse commits (but not actually rewrite them) by rewriting. This notion is pretty wacky at first blush, and requires explanation. In the end, it ties into when (and why) Git either cannot or must re-use a commit.

Git may copy (some or all of) a commit to a new commit—this is what both filter-branch and rebase do in principle—or it may keep a commit and build additional commits upon that commit, by making new commits that use that commit's ID as their parent IDs (or one of several parent IDs, in the case of a merge). The latter is what normal git commit and git merge do, for instance.

In any case, though, the crucial thing here is that the ID of a commit, or indeed any of Git's objects, is the commit (or the object), in an important and fundamental sense. The ID is a cryptographic hash constructed from the commit's complete contents, and any commit with exactly the same contents has the same ID, and any ID that is the same has exactly the same contents. If you come up with different contents that hash to the same ID as some previous contents, Git simply won't let you store the new contents at all: it will insist that the object already exists, and when you ask for the contents by the generated ID, you'll get the old contents, not the new ones.

This does mean that Git is, in this same fundamental sense, limited: there are only 2¹⁶⁰ objects that can exist in any Git repository,¹ and once you have stored all of them, no new object can ever go in. Fortunately, this number is so enormous that it's reasonably safe to assume that you not only will never fill it up, but in fact, you will never find two different contents that hash to the same number.

What this means in practice is that Git's object store is append-once-only: At this level, you give Git some content (and a type) and ask it to write the object into the repository, using git hash-object -w. Git calculates the hash, and Git then either stores the object and tells you the hash, or does nothing and prints the hash. You then use this hash to retrieve the contents to double check that, in fact, your content got stored (instead of some other content being re-used due to a hash collision), or simply assume that your content did get stored, or was already present.

This latter case is common when storing files, since every commit stores every file. If the first commit had 10 files, and the second commit has the same 10 files but only one was changed, then the second commit re-uses 9 files. (In fact, unless you explicitly git add all ten files again, Git can optimize away even the "pretend to store the 9 reused files" step. But if you did git add all 10, and only one had changed, then 9 of the 10 blob-object writes simply computed the hash of some existing object, and re-used the object.)

¹This assumes Git is is forever committed to SHA-1, which produces a 160-bit hash digest. Some parts of Git make switching difficult and others make it easy. Mercurial has a similar issue except that its internal format allows for direct switching to a 256 bit hash. Should anyone want something larger (see https://en.wikipedia.org/wiki/Secure_Hash_Algorithm and note that there are 512-bit hashes), Mercurial would also have some difficulty.

What's in a commit?

The second key to understanding this is to look at the actual contents of a real commit. Here is one from the Git repository for Git:

$ git cat-file -p HEAD~2 | sed 's/@/ /'
tree fba3eb43b1cdde5c0201287b16b295fee295b495
parent 930b67ebd7450a72248111582c1955cd6f174519
parent 5cb5fe4ae0f9329843c9b028b45df9c6b987c851
author Junio C Hamano <gitster pobox.com> 1473719678 -0700
committer Junio C Hamano <gitster pobox.com> 1473719678 -0700

Merge branch 'sb/transport-report-missing-submodule-on-stderr'

Message cleanup.

* sb/transport-report-missing-submodule-on-stderr:
  transport: report missing submodule pushes consistently on stderr

I picked a merge here, so that it has two parents instead of the more typical single parent. The important items here are:

tree: there is always exactly one for each commit; this is the hash ID for the top-level tree for the commit. (You can then git cat-file -p that tree object to find its sub-trees and files.)
parent: there is one parent line per parent ID. These give the IDs of the parent commits.
author and committer: there is one line for each, with three parts, giving the person's name and email address, and a time stamp.

These are then followed by a blank line, and then the subject and body of the commit message. Git does not normally interpret the parts after the blank line, nor impose any constraints upon it; the earlier parts have a canonical format, though some versions of Git have been less picky about that as well.²

What this means is that the hash ID of a commit is determined by the tree, the parent ID(s), the author and committer name/email/time values, and the message. If you copy these values, bit for bit, from one commit object, with no changes at all, and then ask Git to hash and write the resulting value, you will get the same object ID, storing the same commit data. It literally is the same commit: just as blob objects get re-used from one commit to another as long as they are bit-for-bit identical, a commit that is bit-for-bit identical to a previous commit does get re-used.

But, if even a single bit is changed, the nature of SHA-1 means that the final hash is wildly different. And, if you make a new commit, even re-using the tree, parent IDs, author name, author email, committer name, and committer email, the new commit will normally have a new, different time stamp, because the time right now is not the same as the time just a second ago. (These time stamp strings count seconds, and are basically Unix time_t values.)

Thus, usually, a new commit has a different ID from every other commit. To get a new commit to truly match an existing commit, you have to keep all the bits the same, including the time stamps. You can do this—the git filter-branch command does it on purpose. But note that this also means that the parent ID(s) must match, bit-for-bit. This means the new commit will re-use any existing parent. Keep this in mind as we move on to git rebase.

²We've seen cases where filter-branch will accidentally modify Unicode in the header part, or cause non-newline-terminated final lines in a commit body to become newline terminated, thus changing the commit's hash in a way we did not expect. This change then propagates a change to every descendant commit through the parent ID lines. But in principle at least, git filter-branch tries not to touch this, and leave any changes to your own filters, so as to preserve commit IDs by preserving commits bit-for-bit.

Rebase copies commits, but usually with something changed

The way rebase works—which is almost the same as the way filter-branch works—is to extract some existing commit, let you make some change(s), and then make a new commit from the result. Most often, there are at least two simultaneous changes:

You start from a different tree (the tree associated with the rebased-so-far branch, or the "onto" commit when doing the first commit). To this tree, you make changes extracted from the commit you're copying: Git does this for you by diffing that commit against its parent, then applying the result of the diff to the tree for the commit you're starting from.
And, you start with a different parent. The new parent for the new copy is the commit the new copy is going after.

If the final tree object is different, or the parent line(s) is/are different, or both, the resulting commit has a new, different hash.

Now, rebase doesn't always actually have to copy commits. Suppose we have the following:

...--B--C--D            <-- main
            \
             E--F--G    <-- topic

If you git checkout topic; git rebase main, Git finds the commits to copy by listing commits reachable from topic (every commit shown here), then subtracting away every commit reachable from main (commits ending in B--C--D). It computes that the target for copying onto is commit D, the tip of main. It must therefore copy E to come immediately after D—i.e., to have D as its parent—and then copy F to come after E, and G to come after F. But E already has D as its parent, so it can do this "copy" by doing nothing at all.

The rebase code is written to do this whenever it can, unless you use -f or --no-ff. In this case, it goes ahead with the copy technique. (See https://www.kernel.org/pub/software/scm/git/docs/howto/revert-a-faulty-merge.html for when and why it makes sense to do this.) Because these are copies, they use the new (current) time and get new time stamps.

There is a potential flaw here though: because the time stamps have one-second granularity, if this rebase happens quickly enough—which can occur if many rebases are run from scripts—it may wind up generating a bit-for-bit identical commit. If this happens, the new commit really is the old commit.

Rapid-fire commits

The same thing can affect branches made by scripting when using --allow-empty. Suppose you have a script that does this:

git checkout -b feat1 main
git commit --allow-empty -m 'create branch for feature'
git checkout -b feat2 main
git commit --allow-empty -m 'create branch for feature'

The idea here is to create two new branches forking from main, each with their own (empty) commit:

       E   <-- feat1
      /
...--D     <-- main
      \
       F   <-- feat2

Now you can record, in some external database perhaps, the IDs of commits E and F for whatever later purpose you have for tracking work done on the two feature branches. But if the two new commits, which are made with the same author and committer name and email, are made in the same second, then the two commits both read:

tree 45ee45ee...
parent dddddd...
author A U Thor <auth@thor> 123456789 -0700
committer A U Thor <auth@thor> 123456789 -0700

create branch for feature

These two commits are bit-for-bit identical, and therefore have the same internal commit ID. What we get is not the graph drawn above, but rather this one:

...--D     <-- main
      \
       E   <-- feat1, feat2

(The fix is simple: give them different commit messages, and/or wait one second between commits. This particular problem may seem unlikely but I had it happen to me! Fortunately it was just for a test.)

Upvotes: 1