Reputation: 281

Know which files were created, modified or deleted in a commit

I am currently trying to display information about a specific commit in my application.

I want to know if these files were created, modified or deleted in this commit, but if I use git show, the information I get will be the number of lines deleted or added.

Upvotes: 3

Answers (2)

torek

Reputation: 487785

TL;DR

Use -m --first-parent along with --name-status to get what you want across merge commits. Note that --first-parent changes the way git log walks the graph, if you are using this with git log -p rather than git show.

Long

You mention git show directly:

... if I use git show, the information I get will be the number of lines deleted or added.

It's worth pointing out here how git show produces a diff listing. This starts with a review of what a commit is.

As the Pro Git book says, each commit acts as a snapshot of all of your source files. In other words, commits don't say Make these changes to some files you already have. Instead, commits say If you want this commit, here are the files—all of them, intact. Extract and go!

The problem with storing deltas, or change-sets, is obvious. Suppose all I tell you is change file main.py by adding these three lines in the middle. You don't even have main.py yet. How are you going to add three lines in the middle?

The problems with storing entire files intact are also obvious, of course:

One objection is that the repository will quickly grow hugely fat and impossible to use: if I make 1000 commits and each commit has a 100-K-byte (approximately) file, I've put 100 megabytes of copies of that file into the repository.

But that's just silly, because my 1000 commits probably have at least 300 copies of that file that are all the same. The next 300 are probably also all the same, and so on—maybe there are only four versions of the big file. And every commit, once made, is permanent (mostly—it's sometimes possible to delete some commits entirely) and read-only (entirely—no commit can ever be changed; at best you make a replacement, and delete the bad one entirely).

I literally can't change the copy of the file I put in, so if 300 commits all use that version of the file, they can just share that version of the file. This means that my 1000 commits have only four copies of the 100 KB file, using 400 KB, not 100 MB, for a factor of 250 in compression.

Git has additional behind-the-scenes tricks to compress this even further. In general, Git adds zlib deflate compression to everything, and in particular, Git also sneaks delta encoding in, invisibly, during what Git calls its garbage collection process. So each commit has a complete copy of the file at a logical level, but (a) it's compressed and (b) somewhere deep in the bowels of Git, the file might be internally delta-compressed against other copies of the file. But you don't need to know any of this to use Git: at the "I have a commit" or "I don't have a commit" level, you either do have the commit—in which case you do have all of its files—or you don't have a commit, and you can't even ask whether you have its files yet.
The other objection is more serious, because it's a problem with actually getting work done. Specifically, if a commit is a snapshot, how are we going to handle things like code reviews and working out where some bug was introduced or fixed? How can we take a fix we made to one version of the program, and apply it to another different version?

From snapshot to delta / changeset

If you're familiar with the tools that existed before Git and many other storage management systems, you know of the rather ancient Unix diff command. This command is at the very least the inspiration for, and perhaps even a direct ancestor of, git diff. Using git diff, we can compare any two commits, and have Git tell us what changed from commit A to commit H for instance.

In essence, if we tell Git:

git diff hash1 hash2

Git just extracts the commit identified by hash1, and then the commit identified by hash2, and then diffs them. Voila, we know what changed between A (hash1) and H (hash2)!

But wait: every commit, in Git, not only stores a snapshot of its files, it also stores the hash ID of its parent commit. Each commit's hash ID is a big ugly string of letters and digits that uniquely identifies that one particular commit. No other commit can ever have this same hash ID. Every other commit gets a different hash ID. The hash IDs are actually cryptographic checksums of the contents of the commits, which is why we can't change anything we've committed: Git uses this save a cryptographic checksum technique to uniquely identify everything that can be identified uniquely like this.¹

What this means in practice is that commits in a Git repository form a sort of chain, with each new commit remembering—or pointing to—its immediate predecessor commit. We can start at the end of this chain and work backwards, so that in a small repository with just a few commits, we might have something like this:

A <-B <-C

Commit C has some hash ID. Commit C stores a snapshot of all the files. And, commit C stores the hash ID of commit B. So if we know the hash ID of C, we can look it up in Git's giant database of "all commits / objects in this repository"—not that giant yet, there are only three commits—and use that to find B's hash ID, which we can look up in Git's database to find A.

What all of this means is that we just need to somehow remember the hash ID of the last commit in the chain. From that last commit, we can work backwards, all the way through the repository to the very first commit ever. Without getting into detail, let me just say that it is the branch name that holds the hash ID of commit C—so that we can finish out the drawing this way:

A--B--C   <-- master

The name master lets us find commit C, which lets us find B, which lets us find A. Commit A has no parent—Git calls that a root commit—which lets us know that the chain ends and we're done.

All of this is a fairly long-winded way of getting to the point that git show can show us what we changed in commit C. It does that by looking at the stored parent hash. The parent of C is B. So to show what we did in C, Git does:

git diff <hash-of-B> <hash-of-C>

We already know that this essentially extracts the two commits and compares them. It's now obvious that this compares the snapshot in B to the snapshot in C—and that, by definition, is what we changed.

¹This includes file snapshots—that's how Git manages to store only four copies of the 100 KB file. The file gets reduced to a checksum, and the checksum is the name of the content-version, as stored in the Git database. These content-versions are stored as what Git calls blob objects. The file system level name of the file, such as big-file.dat, gets stored in a separate object that Git calls a tree object.

In essence, the heart of the Git repository is a collection of objects, stored as a key-value database. The keys are hash IDs and the values are the underlying commit, tree, blob, or a fourth type of object that Git calls an annotated tag object. You don't need to know this either, to use Git. You just need to know that the commits have hash IDs and that these hash IDs form a sort of complicated chain. But it may help to get a full mental picture of what's going on.

Why this doesn't work for merges

Again, without getting too detailed, let's look at a branch-and-merge situation. Here our graph gets a little more complicated, but maybe not too complicated. We'll start with some commit that two branches have in common, and call its hash ID H:

...--H    <-- common-starting-point

Then we'll make two new branches and make one commit on each branch, so that there are now two new commits I and J with new names pointing to them:

       I   <-- branch1
      /
...--H    <-- common-starting-point
      \
       J   <-- branch2

From here we'll make two more commits (and stop drawing in the name common starting point) just for prettiness and/or so that I can call the merge commit M, like this: :-)

       I--K   <-- branch1
      /
...--H
      \
       J--L   <-- branch2

We now make a merge commit M using, e.g., git checkout branch1 && git merge branch2, which gives us this result:

       I--K
      /    \
...--H      M   <-- branch1
      \    /
       J--L   <-- branch2

Note that the name branch1 points to our new commit M. Commit M stores a snapshot of all the files, just like any other commit. It does have something special about it though.

The usual rule for adding new commits is that the new commit points back to its immediate parent. For M that would be K—the commit that the name branch1 pointed to just before we ran git merge. So M stores the hash ID of commit K. But what makes M a merge commit is that M stores a second parent too. We told Git to merge commits K and L, so M has K as its first parent, but then has a second parent L.

(The fact that we used git merge to make M, and that git merge went back to commit H in order to make M, is not stored anywhere. I would argue that it should be—at the least, something about this should be stored in the commit—because there are ways to run git merge that modify its action, e.g., using -X ours or --find-renames=<number>. But Git doesn't store this now, and since no existing commit can ever be changed, we have to be able to get along without that information. For the most part, we can.)

In any case, after we've made the merge, we have this commit M, which has a slight bit of special-ness because it has two parents instead of the usual one. We call this a merge commit, which uses the word merge as an adjective modifying commit. Or, sometimes, we just call it a merge, using the word merge as a noun. This is why I make a big distinction between the verb form, to merge, meaning to invoke Git's merge machinery—e.g., by running git merge—and the noun form, a merge. A merge is a thing, and to merge is an action that often produces a merge.

So, back to git show: let's have git show show commit M. The usual way that git show shows a commit—or rather, shows what we did in a commit—is to do:

git diff <hash-of-parent> <hash-of-commit>

But commit M doesn't have a parent. Commit M has two parents. Which one should git show give to git diff?

`git log -p` and `git show`

Let's take a quick side trip here. The git log command has -p to show each commit as a patch. That is, git log -p is like repeatedly running git show: it shows a commit's log message, then turns that snapshot into a patch. That's exactly what git show does. Then git log goes to the commit's parent, and shows the commit's message and a patch; then it goes to the parent's parent, and so on. In other words, given a nice straight line of commits H then G then F then ..., it walks backwards along that straight line, showing H then G then F and so on.

When git log gets to a merge commit like M, it has two problems:

How do you show a merge as a patch? That's hard, and git log answers this question with the simple answer: I don't.

In other words, git log -p simply doesn't bother to show a patch. That's its default answer, anyway.
Given that merge M has two parents, which parent do you show next? That's hard too, but git log answers it by saying: I show both. Of course it has to pick one to go before the other, and things can get tricky here. Since we're not concerned with git log right now, we'll ignore this part.

The git show command is not quite as lazy as the git log command. It's not going to have to keep logging across both parents, so it's willing to work harder at the show M as a patch problem. But what it does is a little weird.

Commit M is a merge, probably made by running git merge. If the merge went well—if there were no merge conflicts—then Git made all the decisions about how to do the merge. So in this case git show, by default, doesn't show anything. But if there were merge conflicts, whoever did the merge had to resolve them. In this case git show shows where the merge conflicts occurred.

In this case, Git builds what Git calls a combined diff. We take merge M and compare it to parent #1, i.e., commit K, by doing the usual one-pair-of-commits diff. Some files are changed in this diff and some aren't. Then we take merge M and compare it to parent #2, i.e., commit L. Some files are changed in this diff and some aren't. So now we have two lists of changed files:

 M-vs-K       M-vs-L
--------     --------
README.md
main.py      main.py
             stuff.py

There's only one file changed in both diffs, so next, Git throws away the README.md and stuff.py diff listings. It's now ready to combine the diff listings for main.py.

What this combining step does is a little difficult to describe (and not documented). Using -c produces a non-dense result and using --cc produces a dense result (unless a rename detection queue overflow occurs, in which case Git falls back on -c and produces a warning). Note that we've already thrown out two of the three files—that doesn't change, regardless of dense/non-dense here—but now, in the default dense or --cc mode, Git throws out some of the diff hunks as well!

In essence, what git diff --cc does here is to attempt to show only those areas where manual merging was required. Of course, if you used -X ours or -X theirs, manual merging wasn't actually required—Git just took the "ours" or "theirs" side instead—but git diff --cc will still show that diff hunk.

In non-dense mode, git diff -c may show additional diff hunks, though the code for this is a little squirrelly and I am not sure I read it correctly in my quick scan. If you want to examine it yourself, you can find this code in combine-diff.cc.

The key takeaway here, though—the part that is documented, and matters for the original question—is this: Combined diff ignores a lot of actual differences, on purpose, to try to show you only something relevant. That makes a bold, and often unwarranted, assumption about what you consider relevant. Be careful with combined diff.

Note that combined diffs do not occur when you give git diff two commits to compare. You get a combined diff by running a command that automatically picks out the parent hash IDs. When it hits a merge, it automatically picks all of the parents, and—zap—you get a combined diff.

What does work for merges

Let's revisit the graph for a moment:

       I--K
      /    \
...--H      M   <-- branch1
      \    /
       J--L   <-- branch2

Most of Git's show-a-commit-as-a-patch commands diff the parent of the commit against the commit. But commit M is a merge, with two parents, so these commands either show nothing at all, or show a combined diff. If that's not what you want, you need to take control.

Hence, if you have some commit name or hash ID, such as M (hash ID) or branch1 (name), and you want to see what changed between the first parent of M and M itself, you can do this:

git diff M^ M

or:

git diff branch1^ branch1

Here we're using the hat-suffix operator to say go to the first parent. (We can also use ~1, which means go back first-parent one time. The tilde suffix is meant for cases where you want to go back multiple first-parents: you can write branch1~2 to go from M to K and then to I, for instance. For those stuck with a shell that requires typing ^^ instead of just ^—I understand this is a issue on some DOS/Windows systems—you can use ~ always, as branch1~ means branch1~1 which means the same thing as branch1^.)

Both git log and git show—which share a lot of their code; in particular they share all the code that calls git diff for you—have two interesting options:

-m "splits" a merge commit (m stands for merge).

As we've seen repeatedly here, a merge like M has two parents. Using the -m option tells the internal diff code to "split" the merge into two virtual commits. Instead of:
```
       I--K
      /    \
...--H      M
      \    /
       J--L
```
the diff code treats this as:
```
       I--K--M1
      /
...--H
      \
       J--L--M2
```
just for the purpose of the diffing. The two virtual commits M1 and M2 use the snapshot of M but have a different "name". Having been split like this, they now have one parent each, and git show or git log can run git diff twice. The first git diff sees this as K vs M1 and produces one diff, and the second git diff sees this as L vs M2 and produces one diff.

You now have two diffs, one for each parent. (If M is an octopus merge, with three or more parents, you get three or more diffs—one for each parent.)
--first-parent tells git log or git show to look only at the first parent of each merge. Since git show doesn't walk the graph, this has no real effect on it unless you include -m to split the merge when diffing. With git log, it tells Git to walk from the merge, back through only its first parent, and adding -m affects the diff listing too, if you use -p to produce one.

This gives us what works for merges:

A manual git diff, given two commit hash IDs, compares the two snapshots. There's no issues with combined diffs because we didn't have Git automatically choose the parent so Git never has a chance to pick all the parents of a merge.
Or, using -m --first-parent causes git log -p or git show to split the merge into two virtual merges, then use only the first-parent one when running the internal git diff to show the patch.

If you are using git log -p or git show or git diff with the --name-status option to show only file names and the status of that file—A for added, D for deleted, M for modified, and so on—this has the effect of defeating the combined-diff code that, by assuming you wanted to know where the merge conflicts were, makes this produce the wrong answer on a merge.

Upvotes: 1

Jamie Bisotti

Reputation: 2675

I believe adding the --name-status option will show you want you want.

Git docs: git-show

Upvotes: 5