mugetsu
mugetsu

Reputation: 4398

git diff between two branches not matching the changes during a merge

I have two branches, master and feature. If I do:

git diff --name-only master..feature

I get a long list of files, some of them source code, so not excluded by .gitignore

But, when I try to merge feature into master:

git checkout master
git merge feature

I get only a single file changed in master during the merge process.

Why does this happen?

Another interesting thing is, if I try the reverse and merge master into feature, files that were created in the feature branch are deleted.

How do I fix this and avoid this issue in the future?

Upvotes: 3

Views: 4003

Answers (1)

torek
torek

Reputation: 489223

That's not a bug.

Consider the following simple example. Suppose there is a file named example.txt. In branch X, it reads:

This is
quite
a file.

In branch Y, it reads:

This is
not
a file.

What should the result of merging branches X and Y be? Specifically, what content do you expect to appear in the file named example.txt?

What information, if any, have I failed to give you? What else do you need to know before you can even answer this question?

(Try to come up with an answer before you read on.)

Git is about commits, not files

Before we go on, let's note that the unit of storage you deal with, in Git, is the commit, not the file. It's true that commits contain files, but the general idea here is that it's a package deal: a commit has a full snapshot of all of the files. If we take some starting commit:

git checkout somebranch

and split a big file, bigfile.py, into two smaller files, small1.py and small2.py and remove bigfile.py entirely and then commit, the new commit lacks bigfile.py and adds the two smaller files, as compared with the old commit. When we check out the old commit, we have just one of the three files—the big one—and when we check out the new commit, we have just two of the three files. It's a package deal: you can pick the commit with one file, or the commit with two, but you never get both the big file and one of the small ones, or all three files, or some other combination.

Still, commits contain files, and that will be important later when we get around to merging. But besides containing files—that's their main data: a snapshot of every file (as of the way it appeared when you made that commit)—each commit contains some metadata, or information about the commit. This includes the stuff you see in git log output: the name and email address of the person who made the commit, and a date-and-time-stamp, for instance.1

In amongst all this metadata, Git stores, in each commit, the raw hash ID of some earlier commit(s). Most commits store exactly one earlier commit hash ID. These hash IDs are the "true names" of the commits, too: they're how Git actually finds each commit. The commits are stored in a big key-value database, with the hash ID of the commit being the key, and the commit's content being the value.

With each commit storing the previous commit's hash ID, we end up with a nice simple linear chain of commits. If we use uppercase letters to stand in for each hash ID, we get drawings that look like this:

... <-F <-G <-H

where H is the hash ID of the last commit in the chain. Inside commit H, Git has stored the actual hash ID of earlier commit G. Inside commit G, Git stored the hash ID of still-earlier commit F, and so on.

These chains allow Git to work backwards, from the latest commits back to earlier ones. These are the history in a Git repository, so these chains are crucial to using Git. And, since each commit stores a full snapshot, we have to have Git compare two commits to see what changed. If we have Git compare the snapshot in G to the snapshot in H, for instance, that tells us what we changed when we made H from G.

So, this is what git log does: it starts at the latest commit (such as H), prints out the hash ID and the metadata, and if we used -p to get patches, extracts both G and H (to a temporary memory area) and compares the two commit's snapshots to figure out what changed, and show us that. Then, having shown commit H, Git moves backwards one step to commit G: it prints out the hash ID and metadata, and if we used -p, compares F-vs-G. Having printed out G, git log moves back one more step to F, and so on down the line.

(In other words, Git works backwards. I won't emphasize this more here but it explains a lot about Git, once you realize this.)


1If you use git log --pretty=fuller, you'll see that each commit actually has two of these: an author and a committer. Each one is made up of a triplet: name, email, timestamp. Usually both are the same these days, except for cherry-picked commits, where the author of the original commit is retained, and the committer is the person who did the cherry-pick, with the committer time stamp being the time of the cherry-pick action.


Branch names merely help us find commits

To make the above work, we have to know—somehow—the hash ID of the last commit in the chain. We need to give that hash ID to Git, because Git can only find commits by their hash IDs, in the end. We could write down these hash IDs, jotting them on paper, or on a whiteboard, or something. But they're really big and ugly and hard to type in correctly. Plus, we have a computer. Why not have the computer remember the hash IDs for us? We could add a second database to our Git repository: it would hold names, like master or develop or feature, and with those names, remember the hash ID of the last (most recent, most useful, whatever) commit.

That's just what a branch name is: it's an entry in a names database. The actual name is extended a bit: master is really refs/heads/master and feature is really refs/heads/feature. This leaves room for other kinds of names, like tag names: v2.1 is really refs/tags/v2.1. But for branch names in particular, they all hold commit hash IDs—one each—and that hash ID is the ID of the last commit that we're going to consider to be "on the branch".

If we only have one branch, everything is easy:

...--F--G--H   <-- master

Here, the branch name master is the only name, and it holds the hash ID of our most recent commit, commit H. So the name master points to the commit at the end of the chain. That lets us (and Git) access commit H. Commit H points backwards to commit G, which lets us (and Git) access it; commit G points backwards again, and so on.

If we create a new branch name now, such as feature, we can pick any of the existing commits to have this new name point-to. Most often, though, we'll pick the commit we're using: H, via master. So we'll get:

...--F--G--H   <-- feature, master

Now we have a problem. Which branch name are we using? To remember, we'll add a special name, HEAD, and attach it to one of these two branch names. Let's attach HEAD to feature—by running git checkout feature if necessary—and draw that:

...--F--G--H   <-- feature (HEAD), master

We're still using commit H, but now we're using it because of the name feature.

Now let's create a new commit, in the usual way: modify some files, maybe even create new ones and/or remove existing ones, and use git add and/or git rm as needed to get them all updated, and git commit the result. Without worrying too much about all the details, this has Git save away a new snapshot, add some metadata, and write out the collection as a new commit. The new commit gets a new, unique hash ID—something random-looking, and unpredictable since it depends on the exact time at which we make the commit—but we'll just call it commit I. The new commit will point backwards to the existing commit H:

             I
            /
...--F--G--H

Once the new commit exists, even before we get back to being able to run more commands, Git now does its last special trick: it writes the new commit's hash ID into the current branch name, i.e., the one HEAD is attached-to. Since that's feature, we get:

             I   <-- feature (HEAD)
            /
...--F--G--H   <-- master

Commit H comes right before commit I, but it's still the last commit on the master branch. Commit I is the last commit on feature, but commits up through H are on feature too.

Let's go ahead and make one more commit on feature now:

             I--J   <-- feature (HEAD)
            /
...--F--G--H   <-- master

and then run git checkout master. This will take our HEAD away from feature and attach it to master instead. It will also update our work area so that we are using the contents of commit H, rather than the contents of commit J: all our files now match H, not J. Any updates we made and snapshotted into I and J are safely stored there, in I and J, but they're gone from our view now, as we have commit H out:

             I--J   <-- feature
            /
...--F--G--H   <-- master (HEAD)

We could now make another new branch name, say, feature2, and attach HEAD to that:

             I--J   <-- feature
            /
...--F--G--H   <-- feature2 (HEAD), master

and then make two new commits on feature2:

             I--J   <-- feature
            /
...--F--G--H   <-- master
            \
             K--L   <-- feature2 (HEAD)

Or, we could just go ahead and make these commits directly on master:

             I--J   <-- feature
            /
...--F--G--H
            \
             K--L   <-- master (HEAD)

As far as the graph itself goes—the set of commits with the backwards-pointing arrows between them (drawn here as lines because the arrow graphics available in text are poor)—it doesn't matter: we can't change any existing commits (ever), but we can always add new commits, and either way, we end up with this set of commits. It's just a question of which names find these commits. But Git allows us to create, destroy, or move branch names any time we like. The commits don't change; it's just that the names we use to find them might be different.

Merging

It's time to answer the question above: what's missing?

When we merge some commits in Git, this is all about combining work. The idea is that someone, in some series of commits (I-J perhaps), did some work, and someone—probably someone else—in some other series of commits (K-L) did some work. That gives us this:

          I--J   <-- br1
         /
...--G--H
         \
          K--L   <-- br2

Because of the nature of commits—they never change—we can tell, from this graph, that these two lines of work started from a common starting point, namely commit H. It's really easy to see, visually, that everything in J is descended from H, and the same is true for L. They also descended from G, but H is "better" because it's "closer" to the end-point commits.

Now, we already know that Git can compare two snapshots like G and H, or I and J. What if Git can easily compare H directly with J? Well, it can; and if we have Git do that, we'll find out what's different from H to J. That's the work someone did on the top line. So those are the changes in br1.

Similarly, if we have Git compare what's in H to what's in L, we will find out what work someone did on the bottom line. Whatever files are different, and whatever rules we use to change the contents of files in H to those in L, that's what someone did on br2.

This also tells us what's missing. In order to merge example.txt, we need not just the two end-point files—one says quite on line 2, for instance, and the other says not on line 2—but also the base copy of the file. The base copy of example.txt is the copy of the file in commit H. Commit H is the merge base of the two tip commits, and its copy of each file is how we figure out what changed.

If the base copy says:

This is
quite
a file.

then we know nothing changed in the one that still says quite, and one line changed in the one that says not.

If the base copy says:

This is
not
a file.

then we know nothing changed in the one that still says not, and one line changed in the one that says quite.

If the base copy has no line 2—if it reads, in its entirety:

This is
a file.

then we have a merge conflict, because both people made a change: both added a line-2, but they added different line-2-s.

What this means for your case

If the two branch tip commits—the one found by the name master, and the one found by the name feature—are different, that just tells us that they're different. The recipe that Git comes up with, that will change one commit to make it match another commit, just tells us how to change the one tip commit into the other tip commit.

If the merge base commit between these two branch-tip commits is some third commit,2 we need to know what's in that third commit, because that's how git merge will figure out what changed in master and what changed in feature. The merge command will then attempt to combine those two sets of changes, applying the combined changes to whatever is in the merge base.

As phd commented, you can use the triple-dot notation with the git diff command:

git diff master...feature

for instance. This has Git:

  • find the merge base between the two tip commits (let's call that $B); then
  • run the equivalent of git diff $B feature

which tells you what changed on feature, with respect to this merge base. If you then run the same command with the two names swapped around:

git diff feature...master

Git will find the merge base of the same two tip commits,3 and then diff $B vs master: this shows you what changed on master.

Again, what git merge does for these cases is:4

  • run both diffs, saving the output in a temporary area;
  • get the merge base version of every file;
  • combine the diffs, if possible; and
  • apply the combined diffs to the merge base versions of the files.

If this all goes well, git merge will make a merge commit from the result. A merge commit isn't very different from a regular non-merge commit: it still has a snapshot of all files—as built by the combining process above—and some metadata. The special thing about a merge commit is that it lists both branch-tip commits as its parents, so that Git can go back along both branches (which are now combined into one "branch" via the merge commit: this exposes a flaw in the word "branch"; see What exactly do we mean by "branch"?).


2There are some degenerate cases here. In particular, if the merge base is one of the two branch tip commits, we either have a simple "fast-forward-able" case, or else there's nothing to merge. Given what you've posted, you must not have one of these cases, though.

3If there's only one merge base commit—and this is normally the case—it doesn't matter what order the two branch tip commits are listed in. For some complex commit graphs, however, there may be two or more merge base commits. Here, the picture gets rather murky. The git diff command didn't handle this very well, until quite recently; git merge handles it better, but it's still tricky.

4This description makes a lot of assumptions about how you're doing the merge, the shape of the graph, and so on, and is otherwise greatly simplified vs what git merge really does internally. The idea is to capture the overall goal, without getting into some of the stickier mechanics. For instance, this disregards how merge handles the case of a renamed file.

Upvotes: 12

Related Questions