Reputation: 6824

Commit differences missing / conflict related?

Members of my team recently reported changes in their code (lines removed) that did not show in any commits, while still being active in the code base. Technically, they had code in a feature branch, but that code then went into the final branch with some lines missing.

Using the normal git commands for searching commits (git -S 'somexpr' ..., and even git log -u and searching the output), I was able to find where the lines were added. But nowhere in the active branch without the lines could I find those lines later disappearing again.

I'm no git expert, but I've vaguely read something about git not showing merge diffs by default, so I've also experimented with the options -C and --cc. Without success.

Using git blame --reverse HASHWHERELINEEXISTS filename however, I was able to find the hash prefix where the line last appeared. Then using git log and manual search, I was able to find the commit in the log. When examining that commit and the one before it individually, I was still not able to get the diff with the lines disappearing though.

This makes me suspect that maybe those lines disappeared as part of conflict resolution, and that these diffs aren't usually shown anywhere.

I think I finally managed to force git to display the actual diff though (including conflict resolution or whatever). The "trick" was basically to execute git diff HASHFROMREVERSEBLAME..HASBEFORETHATONE (where those HASH.. values refer to copy paste from the git log output mentioned above).

Which leave the questions:

Any experts that can explain what's going on, and what the easiest method to locate/search such changes (possibly conflict related) is?
Assuming I was able to locate the correct diff finally, surely git must have some way of searching through such differences? If so, how?

Upvotes: 1

Answers (3)

Marius Kjeldahl

Reputation: 6824

Thanks to everybody that tried to help. I further diagnosed this today, and things are looking better.

In short, the magic command is:

git log -u -m -S searchexpr

The -u is for showing the diffs, the -m is to include "merge diffs". After some regular work and pulling various branches again, this command now seems to include everything as expected.

Without the -m, it only shows the line being added in the first place. With the -m it also shows the line being removed in a later merge.

I had a colleague of mine test the first command on his copy of the branch on his computer, and he says he does not get any output from that command at all. Sounds like me yesterday. So there seems to be some circumstances where git aren't finding our commits properly. When/if I figure out why I will share.

Update: I also checked with some of the other team members, and there was actually some differences in the branch history on the same branch, which may have added to the confusion. I suspect somebody reset the branch at one time and force pushed it upstream, but not everybody on the team picked up the fact that they should reset to upstream. So people have been doing conflict resolution and probably messed it up somewhat. Live and learn, we're already working on better branching strategies and procedures for commits and MRs.

Upvotes: 0

torek

Reputation: 490048

Your diagnosis here is correct:

I've vaguely read something about git not showing merge diffs by default ...

Specifically, git log -p walks the commit graph (see below), but when it hits a merge, just doesn't bother showing a diff, by default. What you want is -m, possibly combined with --first-parent. See the details below.

I've also experimented with the options -C and --cc...

The -C option is irrelevant here (it is passed to the diff engine, where it means "find whole-file copies", which has other uses but isn't any good for your problem). The -c (lowercase) and --cc (two dashes and two lowercase c letters) options are relevant, but not helpful, as we'll see below.

What to know about the commits themselves

In Git, each commit:

contains a snapshot of all the files Git knows about, in a special, read-only, Git-only, compressed and de-duplicated format;
contains some metadata: information about the commit itself; and
is numbered, by the commit's hash ID, which looks random (it isn't but there's no easy way to predict them and they're not ordered).

Git looks up these things (commits, and other internal Git objects) by their hash-ID numbers, so you need to provide the commit number to Git to get it to do anything useful. Commit numbers, however, are not useful to humans. So we generally don't use them—well, except in special cases, with cut-and-paste for instance. Instead, we use names. In particular we tend to use branch names like master and remote-tracking names like origin/master. A branch name identifies one specific commit, by holding that commit's number.

The special feature of a branch name is that it always holds the hash ID of the last commit on that branch. That might not seem all that useful—what good is it to know the hash ID of the last commit, without knowing the hash IDs of earlier commits?—until we mention that, in its metadata, each commit stores the commit number of the previous commit. Git calls this the parent of the commit.

What this means is that we can draw the commits, with earlier ones towards the left and later ones towards the right, like this:

... <-F <-G <-H

Here, each uppercase letter stands in for some random-looking hash ID. H in particular stands in for the hash ID of the last commit in the chain. Once we can find commit H, we can use commit H's metadata, which contains the hash ID of commit G, to have Git find commit G. Commit G in turn stores the hash ID of earlier commit F in its metadata, so from G, we can move back to earlier commit F, and so on. All we need is the hash ID of the last commit in the chain—and that's exactly what a branch name holds.

Hence, we can draw this a little more simply as:

...--F--G--H   <-- master

The branch name master gives us the hash ID so that Git can find commit H. We omit the arrows between commits because we know that once a new commit is made, nothing inside it can ever change—not any of the files, nor any of the metadata—and that Git works backwards, from child to parent, to find commits.

To make a new commit, Git will:

write out the files (in the special read-only de-duplicated Git format);
write out the appropriate metadata, including the name and email address of whoever is making the commit, and "now" as the date-and-time-stamps—and in this case, H's hash ID as the (single) parent of the new commit—to give us:
```
 ...--F--G--H   <-- master
             \
              I
```
finally, now that commit I exists—the creation of the commit assigns it its new unique hash ID—git commit will make the name master hold its new hash ID:
```
 ...--F--G--H
             \
              I   <-- master
```

There's no reason to keep the kink in the drawing; we could now just write:

...--H--I   <-- master

How `git log` shows you commits

Skipping over a bunch of important details, one of which we'll touch on in a moment, what git log does is to walk backwards through the commits. By starting from a name like master, we—or Git anyway—can find the latest commit, such as commit I. This has a full snapshot of every file that Git knew about when we made commit I.

Next, Git steps back one commit, to H. This of course also has a full snapshot of every file. So Git essentially extracts both commits, and compares all the files in the two commits. This is a form of git diff, as run by git log -p: compare any two commits. Here, the two commits are I (the current one in the walk) and H (its parent).

For files that are the same, the diff code does nothing at all. There's nothing to say about these files. For files that are different, the diff code comes up with some set of changes that would change the left-side (H) commit copy into the right-side (I) commit copy. That's the diff that you see. (For all-new files, or files that got deleted, you see an appropriate recipe here as well. The -C option tells Git: If a right-side file is all-new, see if it's actually wholly or partly copied from some existing file in the left-side commit.)

This is fine for these ordinary, simple, single-parent commits, but it doesn't work for merges.

Merge commits are slightly different

When you use git merge to make a real merge, the merge commit:

holds a snapshot, as usual; and
has metadata, as usual, except
in the metadata, the merge commit holds more than one previous-commit number.

It's this last fact, which is really pretty simple, that makes a commit a merge commit.

We can draw a merge commit like this:

...--I--J
         \
          M   <-- branch
         /
...--K--L

This merge commit has two parents. Most merge commits look like this, although Git also supports what they call an octopus merge, where there are three or more parents.

When we hit a merge commit in the commit-walking process, git log has to get more complicated. What it does by default is to walk both legs of the incoming commits, in some order, going from M back to L, but also going from M back to J. You can use the --first-parent flag to tell the graph-walking code to look only at the numerically first parent, which in Git, is the commit from the branch you were on at the time you ran git merge. (The other parent, or parents for an octopus merge, are the other commit or commits that you merged.)

But the git diff code has a problem. You can't really diff the merge commit's snapshot against either parent and get something sensible. At the same time, you can't diff the merge commit against all the parents at the same time ... unless—well, here things get a bit squirrelly.

For git log itself, its solution is don't bother showing a diff at all. Unfortunately this solution completely hides an incorrect merge, which is why you can't find the bad merge this way.

For git show, which shows one commit's log message and a patch, the default solution is to use --cc mode, which leads us to the somewhat peculiar definition of a combined diff in Git. The -c option also produces a combined diff, with a slightly different combination method. But both of these are useless for your particular problem because of one special feature of a combined diff.

Git's combine diff omits some files entirely

When Git produces a combined diff, what it does is:

For each parent of the merge-commit child, do a quick diff to find identical files and different files. (Because of the internal storage format, with de-duplication of files, this part is very fast.)
For any file that's entirely the same in the child as any parent's copy of that same file, don't say anything at all.
For every file in the child that's different from that same file in every parent, run a diff against each parent. Then show parts of that diff. (Precisely which parts you see depend on whether you used -c or --cc, but they're both pretty similar.)

Since your case involves someone who accidentally made the merge commit use exactly the same file as one of its parents, instead of taking changes from both parents, a combined diff will, by definition, skip right over that file. So that's what makes combined diffs useless here.

The `-m` option

The -m option—which is available for both git log and git show—tells Git to pretend, just for diffing purposes, that one merge commit is N separate commits, where N is the number of parents. That is, given:

...--I--J
         \
          M   <-- branch
         /
...--K--L

the git log command will still cover M, go back to J, and then from J to I, and so on; and will still also cover L, and K, and so on, as needed, in some order. But while showing M` itself, Git will pretend that there exist two separate commits that look like this:

J--M1

L--M2

and therefore run two git diff commands, one that compare the snapshot in J to the snapshot in M1, and the other that compares the snapshot in J to that in M2 (with the two "M" snapshots both really being the snapshot in M itself, of course).

In one of these two diffs, the file you care about will not be changed at all. In the other, you'll see that some line that should be carried through—say from L to M2—was changed to match the line in J instead. That shows you the bad commit, and who made it.

Why a lot of this doesn't really matter

Except for educating whoever made the bad commit, the only thing you can do at this point is to make a new commit that has the file corrected.¹ It doesn't matter who makes this corrected commit. All of the previous commits literally cannot be changed. So just make a fix, commit it, and move on.

History rewriting

If you like, you can do what people call "history rewriting". Here we take a series of commits:

...--I--J
         \
          M--N--O--P   <-- branch
         /
...--K--L

where there is some problem, say at M, and make a series of new commits:

...--I--J
         \
          M'-N'-O'-P'   <-- replacement-branch
         /
...--K--L

The old commits continue to exist, but since we find them by name, all we have to do now is get everyone to swap the name replacement-branch for the name branch, and vice versa. Then the old (bad) commits will be found under the new name, and the new (good) commits will be found under the old name.

The problem with history rewriting is that we have to convince everyone with a clone of the bad repository to change their branch names. While commits get shared—you connect one Git to another and the Git that doesn't have some commits will, in general, get those commits from the other one—every copy of a Git repository has its own private branch names. So everyone who is using the "wrong" branch has to update their private branch name so that it uses the new and improved commits. It's easy to get those commits, but the default will be to merge the new and improved commits with the old bad ones, which is just what you don't want.

Still, if there's not that much history, and few clones that have the bad commits, the history rewrite trick can be a good option. It's not trivial when there are merges involved, though—and not worth trying to write up here (there are other StackOverflow answers covering this sort of thing).

Upvotes: 3

LeGEC

Reputation: 52206

git log -S <pattern> will not display a commit where the pattern is "moved" from one file to another.

If you have an identified file or directory where this line has disappeared, you can look in the changes that target only this file :

git log -S <pattern> -- this/file
git log -S <pattern> -- this/directory

the computations of -S will now be limited to the diffs that affected this/file or this/directory, instead of the diff of the complete repo ;

or use -G <pattern>, which will show diffs where the pattern appears, even if its count didn't change.