sixeyes
sixeyes

Reputation: 587

How to diff changes when file reports as new / deleted

I renamed two files and made some changes (in Visual Studio). git status showed the following:

    On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    deleted:    Core/Models/Metadata/MetadataModel.cs
    deleted:    Core/Models/Metadata/MetadataModelCollection.cs
    new file:   Core/Models/Metadata/MetadataValueModel.cs
    new file:   Core/Models/Metadata/MetadataValueModelCollection.cs

If I try git diff --staged it doesn't show the differences between the deleted and new files. Instead it lists all the lines in each file as either deleted or added. Not surprising since git didn't recognise the change as a rename.

How can I diff MetadataModel.cs and MetadataValueModel.cs? Or MetadataModelCollection.cs and MetadataValueModelCollection.cs?

In case it matters I'm using Windows 10 Pro and git version 2.20.1.windows.1

Upvotes: 2

Views: 1479

Answers (1)

torek
torek

Reputation: 489758

TL;DR

You have two choices here: either make multiple commits, with smaller changes at each step. Or, use the --find-renames=percentage argument (spelled with -X find-renames=... for git merge, but --find-renames=... or -M... for git diff), to lower the similarity threshold from the default 50%. Note that there is no knob to do this with git status: git status always uses 50%.

Long

This is fundamentally a question about identity. Philosophically, this is The Ship of Theseus, or the Grandfather's Axe paradox. ("This is my grandfather's axe. My father replaced the handle, and I replaced the head, but it's the same axe. Or is it?")

How do you know that file "old.name" got (1) renamed to "new.ext", and (2) massively changed, between time point A and time point B, so that even though the entire name is different and most of the content is different, we should call it "the same" file? Well, you probably did the rename yourself, so of course you know. :-) But will Bob or Carol know? How? Will Git know?

The answer to the last is no, Git will not know. Git simply does not record this information. Git just makes and uses snapshots. A snapshot either has a file named Core/Models/Metadata/MetadataModel.cs, or doesn't have a file with that name. If the two to-be-compared snapshots both have a file with that name, Git assumes1 that both files are "the same" file, just with some changed contents. If one snapshot has the file and another doesn't, it's more complicated.

What Git does instead is to (attempt to) detect renames after the fact. Suppose the left side snapshot has Core/Models/Metadata/MetadataModel.cs and the right side snapshot doesn't, but the left-side snapshot does not have Core/Models/Metadata/MetadataValueModel.cs and the right-side snapshot does. That's the case right here, for instance.

In this case, there is some chance that the file was renamed (and maybe modified as well). If you ask Git to do so, Git will compare the contents of all files that are there on the left and not on the right to all files that are there on the right and not on the left. For any two files whose content are sufficiently similar, Git assigns the pair of files a "similarity score", which Git expresses as a percentage—a number between 0% (not at all similar) and 100% (exactly identical).

The 100% identical case is especially useful, because it is guaranteed to work and is extremely quick.2 So if you rename a file without changing it at all, and then commit the result immediately, the "before" and "after" commits are nearly identical. They have all the same files, with all the same content, except for one pair of files—or two pairs, or N pairs, if you rename two, or N, files. Git can compare the left-side commit to the right-side commit, see that all files are already paired except for the renamed ones, and then do the content-comparison using the fast 100%-exact-match case and detect the renames.

Having made the intermediate commit, you can then make changes—even massive changes—to the renamed files, and make another commit. When Git compares the parent and child commits, all the files have the same names, even if the contents of some have changed massively, and Git can then give you a file-by-file diff for the paired-up files that did not change names. (See footnote 1 again.)

This won't help when you compare the first snapshot, pre-rename, to the last one, post-rename and post-massive-change. It will only help when you go re-rename to post-rename, and then as a second step, post-rename to post-massive-change; or equivalently, one commit at a time backwards, as Git usually does. So it won't help much with a later git merge.

For cases where this is not suitable—including at git merge time, when git merge runs git diff --find-renames on the base and tip commits without ever looking at any of the commits in between—you can lower the minimum similarity. What we were doing above, by making two commits, was taking advantage of the fast and easy case: Given two files with different names, but 100% identical contents, Git pairs them up easily. But given two files with different names and only, say, 90% similar contents, Git can still pair them up. It just takes more work.

The more you change the contents of the renamed files, the harder it is to say that the two files are similar. But Git will try anyway—it will try every possible pairing.3 The best match, whatever that is, is the one taken, as long as it meets or exceeds the minimum match you specified. That minimum defaults to 50%.

To choose something other than the default, use, e.g., git diff --find-renames=30 for 30%, and git merge -X find-renames=30 to use the same reduced limit during merges. How can you tell what percentage to use? The answer is really just try it out—the similarity index computation is a bit weird, so you just have to experiment to see what works for your cases. If you have two commit hash IDs, you can run git diff --find-renames=25 --name-status --diff-filter=R to see what got paired up at 25%, and repeat with 75 or whatever other number you like if there are too many or too few pairings.

When you run git status, that runs two git diffs, each of two trees:

  • HEAD vs the index
  • the index vs the work-tree

Both comparisons have rename detection turned on and set to 50%. There is no option to change this.

Neither the index, nor the work-tree, are actual commits, so you can't quite hand them to git diff, but git diff itself can do the same comparisons, and here you can use the options:

git diff --cached --name-status --find-renames=...  # for HEAD vs index
git diff --name-status --find-renames=...           # for index vs work-tree

Add --diff-filter=R to show only the detected renames, if that's what you care about.

Note that --find-renames is on by default since Git 2.9, and off by default in earlier Git releases. Using --find-renames turns detection on at 50%, or at the number you supply. The configuration setting diff.renames can be set to true, false, or copies or copy. Only the porcelain diff commands (such as git diff, git log, and git show) use the configured diff.renames—the plumbing commands are unaffected by user settings. (This is a big part of what makes them "plumbing commands".)


1When using git diff, you can tell Git to break a pairing. That is, if you have two files with the same name, but radically different content, you can tell Git: Before doing rename detection, break up pairs of files whose content is too different. Put the broken-up pair into the rename-detection pool. This option is not available in git merge, only in git diff.

2Git stores each content by hash ID, so detecting that file with name X in commit A is 100% identical to file with name Y in commit B is just a matter of looking at the hash IDs. If the hash IDs match, the files match too. Having found these 100% identical content matches, Git has now paired A:X with B:Y and the two names are no longer in the "files to be paired" pool.

Note that while this is fast and easy and guaranteed to work, if there is also a B:Z that's 100% identical to A:X, there's no telling whether A:X is going to be matched with B:Y or B:Z. Here, instead of—or in addition to—rename detection, you may want to enable Git's copy detection, so that Git can say that A:X got copied to both B:Y and B:Z. The details here, of the interactions, get a bit complicated.

3In fact, there's a limit to how many pairings Git will try. The rename detection code has two file-name queues: unmatched on left and unmatched on right. The similarity computation must compare every left and right entry, which is len(left) * len(right) file comparisons. If the two lengths are N this is N2—very expensive computationally. Git therefore has a setting called renameLimit, which limits the lengths of the queues. This limit was 100 originally, then increased to 200 in Git 1.5.6, and then to 400 in Git 1.7.4.2 / 1.7.5, but you can set it to "unlimited" by configuring the limit to 0, if you like (though Git will still limit it internally to 32767).

There is a separately-configurable merge rename queue length limit, currently defaulting to 1000. If you set diff.renameLimit but do not set merge.renameLimit, both use the diff.renameLimit value.

Upvotes: 4

Related Questions