Dan Fabulich
Dan Fabulich

Reputation: 39563

git-log missing merge commit that undid a change

Consider this test script.

#!/bin/sh -x

#initialize repository
rm -rf missing-merge-log
mkdir missing-merge-log
cd missing-merge-log
git init

# create files, x, y, and z
echo x > x
echo y > y
echo z > z
git add -A .
git commit -m "initial commit"

# create a branch
git branch branch

# change x and z on master
echo x2 > x
echo z2 > z
git commit -am "changed x to x2, z to z2"
git log master -- x

# change y and z on the branch
git checkout branch
echo y2 > y
echo z3 > z
git commit -am "changed y to y2, z to z3"

# merge master into branch
git merge master
# resolve z conflict
echo z23 > z
git add z
# undo changes to x during merge conflict resolution
# (imagine this was developer error)
git checkout branch -- x
git commit --no-edit

# merge branch into master
git checkout master
git merge branch

# now the x2 commit is entirely missing from the log
git log master -- x

We first create three files, x, y, and z, and create a branch named branch. In master, we commit a change to x and z, and in the branch, we commit a change to y and z.

Then, in the branch, we merge from master, but during merge conflict resolution, we revert the change to x. (For the sake of this example, imagine that this was a developer error; the developer didn't intend to reject the changes to x.)

Finally, back in master, we merge the changes from the branch.

I would expect at this point for git log x to show three changes: the initial commit, the change to x on master, and the branch commit that reverted the changes to x.

But instead, at the end of the script, git log just shows the initial commit to x, giving no indication that x had ever been modified! This using git version 2.22.0.

Why is git log doing this? Are there parameters to git log -- x that would show what happened here? git log --all -- x doesn't help.

(git log --all does show everything, but in real life that would show all changes to all files, including irrelevant changes to y and z, which would be too difficult to wade through.)

Upvotes: 3

Views: 1934

Answers (1)

torek
torek

Reputation: 488183

TL;DR

Use --full-history—but you probably want more options too, so read on.

Long

First, many thanks for the reproducer script! That was very useful here.

Next:

(git log --all does show everything, but in real life that would show all changes to all files, including irrelevant changes to y and z, which would be too difficult to wade through.)

Yes. But it demonstrates that there's no issue with any of the commits; the problem is entirely of git log's making, here. It has to do with the dreaded History Simplification mode, which:

git log master -- x

invokes.

git log without History Simplification

Let me add the output from:

git log --all --decorate --oneline --graph

("git log with help from A DOG"), which since I did a reproduction using the script will have different hash IDs than you (or anyone else doing another repro) will have, but has the same structure, and thus lets us talk about the commits:

*   cc7285d (HEAD -> master, branch) Merge branch 'master' into branch
|\  
| * ad686b0 changed x to x2, z to z2
* | dcaa916 changed y to y2, z to z3
|/  
* a222cef initial commit

Now, a normal git log, without -- x to inspect file x, does not turn on history simplification. Git starts at the commit you specify—for instance:

git log dcaa916

starts at dcaa916—or at HEAD if you did not specify anything.

In this case, then, git log starts with commit cc7285d. Git shows that commit, then moves on to that commit's parent(s). Here there are two parents—dcaa916 and ad686b0—so Git places both commits into a priority queue. Then it pulls one of the commits from the head of the queue. When I try this, the one it pulls out is dcaa916. (In more realistic graphs, it will by default use the one with the later committer timestamp, but having built this repository with a script, both commits have the same timestamp.) Git shows that commit and places dcaa916's parent a222cef into the queue. For topological sanity, given this particular graph, the commit at the front of the queue is now always going to be ad686b0, so Git shows that commit and then....

Well, now, the parent of ad686b0 is a222cef, but a222cef is already in the queue! This is where that "for topological sanity" thing comes in. By not showing a222cef too early we make sure that we don't accidentally show a222cef twice (among other issues). The queue now has a222cef in it, and nothing else, so git log takes a222cef off the queue, shows a222cef, and puts a222cef's parents in the queue. In this reproducer-example there are no parents, so the queue remains empty, and git log can finish, and that's just what we see with a regular git log. With help from A DOG, we get the graph too, and the one-line output variant.

git log with History Simplification

Git doesn't have file history. The history in a repository consists of commits. But git log will do its best to show a file history. To do that, it has to synthesize one, and to do that, Git's authors chose to simply omit some subset of commits. The documentation tries to explain that with a one-sentence paragraph:

Sometimes you are only interested in parts of the history, for example the commits modifying a particular <path>. But there are two parts of History Simplification, one part is selecting the commits and the other is how to do it, as there are various strategies to simplify the history.

I think this one-paragraph explanation just doesn't work, but I have not yet come up with what I think is the right explanation, either. :-) What they are trying to express here is this:

  • Git isn't going to show you all the commits. It's going to show some selected subset of commits.

    This part makes perfect sense. We already see that even without History Simplification: Git starts with the last commit, the one we specify with a branch name or with HEAD or whatever, and then works backwards, one commit at a time, placing more than one commit at a time into its priority queue if and when necessary.

    With History Simplification, we still walk the commit graph using a priority queue, but for many commits, we're just not going to show the commit. OK so far—but now Git throws in the twist that led them to write that weird paragraph.

  • If Git isn't going to show you all commits, maybe it can cheat and not even bother to follow some forks.

    This is the hard part to express. When we work backwards from branch-tip towards the commit-graph root, every merge commit, where two streams of commits join up, becomes a fork, where two streams of commits diverge. In particular, commit cc7285d is a merge, and when we don't have History Simplification happening, Git always puts both parents into the queue. But when we do have History Simplification happening, Git sometimes doesn't put these commits into the queue.

The really tricky part here is deciding which commits get into the queue, and that's where the documentation's "more detailed explanation" and TREESAME notion come in. I encourage people to read through it, because it has a lot of good information, but it's very densely packed and is not very good at defining TREESAME in the first place. The documentation puts it this way:

Suppose you specified foo as the <paths>. We shall call commits that modify foo !TREESAME, and the rest TREESAME. (In a diff filtered for foo, they look different and equal, respectively.)

This definition depends on the commit being a non-merge commit!

All commits are snapshots (or more correctly, contain snapshots). So no commit, taken on its own, modifies any file. It just has the file, or doesn't have the file. If it has the file, it has some particular content for the file. To view a commit as a change—as a set of modifications—we need to pick some other commit, extract both commits, and then compare the two. For non-merge commits, there's an obvious commit to use: the parent. Given some chain of commits:

...--F--G--H--...

we'll see what's changed in commit H by extracting both G and H, and comparing them. We'll see what's changed in G by extracting F and G, and comparing them. That's what the TREESAME paragraph here is about: we take F and G, say, and strip out all but the files you asked about. Then we compare the remaining files. Are they the same in the stripped-down F and G? If so, F and G are TREESAME. If not, they're not.

But merge commits have, by definition, at least two parents:

...--K
      \
       M
      /
...--L

If we're at merge commit M, which parent do we pick to determine what's TREESAME and what's not?

Git's answer is to compare the commit to all of the parents, one at a time. Some comparisons may result in "is TREESAME", and others may result in "is not TREESAME". For instance, file foo in M may match file foo in K and/or file foo in L.

Which commits Git uses depend on the additional options you supply to git log:

Default mode

Commits are included if they are not TREESAME to any parent (though this can be changed, see --sparse below). If the commit was a merge, and it was TREESAME to one parent, follow only that parent. (Even if there are several TREESAME parents, follow only one of them.) Otherwise, follow all parents.

So let's consider merge cc7285d, and compare it to each of its (two) parents:

$ git diff --name-status cc7285d^1 cc7285d
M       z
$ git diff --name-status cc7285d^2 cc7285d
M       x
M       y
M       z

This means that git log will walk only the first parent, commit cc7285d^1 (which is dcaa916)—this is the one that doesn't change x:

... If the commit was a merge, and it was TREESAME to one parent, follow only that parent. ...

So this git log walks commit cc7285d, then commit dcaa916, then commit a222cef, and then stops. It never looks at commit cc7285d^2 (which is ad686b0) at all.

The rest of this section of the git log documentation describes the options --full-history, --dense, --sparse, and --simplify-merges (and even I don't understand the true purpose of the last option :-) ). Of all of these, --full-history is the most obvious and will do what you want. (--ancestry-path and --simplify-by-decoration are this section as well but they don't affect paths at merges.)

Caveats

While --full-history will make sure that Git walks through all "legs" of each merge, git log -p itself by default shows no diffs for merge commits. You must add one of three options—-c, --cc, or -m—to make git log -p show any diff at all for any merge.

If your goal is specifically to find a bad two-parent merge, one that drops some particular change that should have been retained, you probably want to show the diff from that merge to at least one, and perhaps both, of its two parents. The git show command will do this, but its default is --cc style. The git log command won't do it at all. If you add --cc to your git log, you'll get the same diff that git show would show by default—and that's not going to work either.

The --cc or -c options tell Git that, when looking at a merge commit, Git should diff the commit against all the parents, then produce a summary diff, rather than a detailed one. The contents of the summary exclude parts that match one or all parents. You're looking for a merge that accidentally dropped an important change—a merge that is the same as at least one of its parents, when it should be different from that parent. This combined diff is going to hide the place where the change isn't-but-should-be. So you don't want -c or --cc.

That leaves the -m option. When git show or git log is going to show a diff, and the commit is a merge commit, Git will show one diff per parent. That is, for a merge commit like M, git show -m will first compare K vs M and show that diff. Then it will compare L vs M and show the other diff. That's the option you want here, for this particular case.

Note that -m combines nicely with --first-parent to show only the full diff against the first parent of each merge. Often that's exactly what you want.

Upvotes: 5

Related Questions