How do I really apply a patch created with Git diff?

Question

I've been reading a lot of related/similar questions on this site but none of them would work, and I don't seem to be witnessing the same kind of errors so I decided to open yet a new question on this.

I am trying to learn some more of git, specifically, how to apply patches and extract commits from some branches and apply it to other branches. I initially wanted to do a dummy test, which consisted of picking some commits from a branch (up to some point in the past) and reapply those commits to that same point in the past, to bring me back to the initial point.

However, I am getting a ton of error messages of the kind "error: patch does not apply".

I can't understand why it isn't working. I tried adding options such as --whitespace=fix and so on (that were suggested in other questions in this website), to no avail. I also tried to do with -3 in hopes I could manually merge the files but that just changes the error messages to "Error: Patch failed: filename" to virtually all files again.

To reproduce this error, I am using the following git repository: https://git.evlproject.org/linux-evl.git

Specifically, the branch with the commits is evl/v5.4, and the branch without the commits is master. I tried then:

git diff evl/v5.4 master > ../patchfile
git checkout master
git apply ../patchile

torek · Accepted Answer

It would be a bit of a surprise if such a patch did apply:

git diff evl/v5.4 master > ../patchfile

Remember that git diff compares two commits, or more precisely, the snapshots in the two commits. I like to call the two commits L and R, for "left" and "right", though there's no common agreed-upon naming convention here.

For the L (left-side) commit, you choose the commit that evl/v5.4 selects. For the R (right-side) commit, you chose the commit that master selects. That's no problem so far.

Now, remember that the output from git diff is a series of instructions. These instructions, if applied, will change the set of files that appear in commit L to produce the set of files that appear in commit R. In other words, the output of this git diff gives instructions that will change evl/v5.4 into master. This will, in general, include instructions of the form add the following three lines after line 45 of path/to/file.ext, which appear in this context or delete one line of the following lines of some/file, which appear in the following context.

The context is what's in L, and the instructions (if and when applied) produce what's in R.

git checkout master

This obtains commit R. You don't have commit L out. The instructions for changing L into R are kind of pointless here.

You could reverse-apply the patch. After all, instructions that will turn L into R can be "followed backwards", as it were, to turn R into L. Well, that is, as long as none of the instructions is simply delete file F since that requires creating a new file F. If the instructions say delete file F whose contents are ..., we can use that to create new file F.

On a variant of this topic...

how to ... extract commits from some branches and apply [them] to other branches

A commit is a snapshot, not a set of changes. But it's not just a snapshot: it's a snapshot plus some information about the snapshot. This metadata, or extra information about the data—the snapshot being the data—includes the name and email address of the person who made the commit. It includes some date-and-time-stamps. It includes a log message, which is pretty much arbitrary and up to the person who made the commit. But importantly for Git, it also includes the raw hash IDs of some set of earlier commits.

Git finds each commit by its hash ID. The hash ID is, in essence, the "true name" of a commit. The hash ID of a commit can never change, and the contents of the commit itself can never change either. (Git ensures both of these by the way it stores each of its internal objects in a key-value database, where the keys are hash IDs, and the hash IDs are cryptographic checksums over the contents as stored under that key.)

A branch name simply holds the hash ID of the last commit in some chain of commits. The chain can be pretty simple and linear, and many are. If we use uppercase letters to stand in for the hash IDs, we get a drawing like this one:

... <-F <-G <-H

where the last commit is the right-most one, i.e., commit H. This commit contains data (a full snapshot of every file) and metadata: who made it, when, and why, and the hash ID of earlier commit G.

We pick a branch name that we'd like to use to find H, and have Git store the actual hash ID of commit H in that name:

...--F--G--H   <-- master

I've stopped drawing the backwards-pointing arrows between commits as arrows, but they really are a sort of arrow coming out of each commit. It's just that, with the commit contents frozen for all time, H will forever point to G, and since we know that the commit hash IDs look random, G can't know what its future parent H's hash ID will be, so the connections must go backwards.

Given the name master, then, we have Git find commit H by its hash ID (stored in the name master). Given commit H, we can have Git find G's hash ID: that's part of the metadata in H. Given G's hash ID, we can have Git find commit G. So, once we've found the last commit, we can work back one hop, to the second-to-last commit.

That commit, of course, has a hash ID embedded inside it as well. From G, we can hop back to F. We can keep this up as long as the arrows keep going, all the way back to the very first commit ever. (Being the very first commit ever, it has no backwards-pointing arrow, which is how we / Git know to stop going back.)

What this means is that the commits in a repository are the history in the repository. History is nothing but commits. The commits all connect, backwards. A repository is just a collection of commits, and names—branch names, or any other names—just give us a way to get into the commits.

To add a new commit to this repository, we check out the existing commit H:

...--G--H   <-- master (HEAD)

which makes master the current branch and commit H the current commit, all of which we can find by using the special name HEAD, which is now attached to the name master.

Then, we make some changes to some files that aren't actually in Git. (The files that are in Git can't be changed.) We have Git copy those files into a new commit, add on some metadata—including a name and email address, and "now" as the author and committer time stamps, for instance—and hash this all up and get a new, unique hash ID. (The timestamp stuff helps ensure that this commit gets a totally new hash ID, even if everything else is the same, though normally the data in a new commit isn't the same the data in the previous commit ... and, moreover, the parent hash IDs won't match. But the time won't match either.) The parent for our new commit will be commit H. Git can now write out all the data and metadata and thus make the new commit. We'll call its big ugly random-looking hash ID I, and draw it in, pointing back to H:

...--F--G--H
            \
             I

Now comes the sneaky trick: Git simply writes I's hash ID into the name master, to which the special name HEAD is attached. So we don't need to draw I on a line of its own after all:

...--F--G--H--I   <-- master

Nothing in any of the existing commits changed. New commit I is the last one, and it points back to H. The branch name changed, or rather, the hash ID stored in the branch name changed. The name points to the last commit—by definition, actually. If we force Git to point the name back to commit H, commit I simply vanishes from view: it's still there, but we can't find it any more, unless we saved its hash ID somewhere.

Now, no matter what else is going on, we have one of these graph things, with branch names pointing to the last commit in each chain. So if we have, say:

          I--J   <-- branch1
         /
...--G--H   <-- master
         \
          K--L   <-- branch2

then the last commit on branch2 is L, the last commit on branch1 is J, and the last commit on master is H. Commit H is actually on all three branches, because in Git, the notion of being "on a branch" just means that we can start at the end—the way Git does, backwards—and work backwards to reach the given commit. From L, we can hop to K, then to H, so commit H is on branch2. Or, using the name master, we start at H, so commit H is on master.

Meanwhile, if we take any parent/child pair—say, K-L, as it appears on branch2—we can have Git compare these snapshots. For all the files that are the same, Git says nothing at all. The instructions to change K into L for that file are do nothing at all. For each file that is different, Git shows some instruction(s); these tell us how to change the file as it appears in K, to make it as it appears in L.

If we like, we can git checkout branch1:

          I--J   <-- branch1 (HEAD)
         /
...--G--H   <-- master
         \
          K--L   <-- branch2

and now we have, as regular files that we can work on, every file that's in J. Git basically copied all the files out of commit J, into a work area.

To the extent that the instructions for changing K to L apply, we can have Git apply those instructions. We can do this by finding the two hash IDs for commits K and L, and running:

git diff

to get these instructions. Then we can try to use those instructions on the files we have checked out right now. They might not all work because maybe some files are gone, or some file where we are supposed to change line 42 doesn't have that line any more. But we can try to apply these changes.

To do this automatically in Git, we don't have to use git diff and git patch. Instead, we can use git cherry-pick. This is actually considerably fancier, because cherry-pick uses Git's internal merge mechanism to combine changes. But, for now, you can think of cherry-pick as compare parent and child, find the differences, and apply the differences to whatever commit we have out now.

Because Git has the graph, and commit K connects (backwards) to commit J, we only need to tell Git to cherry-pick the hash ID of commit K:

git cherry-pick

There are some easier, shorter ways of specifying particular commits, that don't involve typing in the entire hash ID. Of course, nobody sane tries to type in a whole hash ID in the first place: we use cut-and-paste to copy the hash IDs. It's way too easy to typo something (though, fortunately, hash IDs are sparse enough that this just results in Git saying whaddaya talkin' 'bout?!). But I won't go into that here; this is plenty for now.

[Edit, 2 Jan 2021] After cloning the repository in the question, I can run the following. Note that the current branch is master and the work-tree has no untracked files initially. A git clean -dfx produces no output. It's important to use --index with the git apply below; I'll explain why in a moment.

$ git diff --no-renames master evl/v5.4 > ../patchfile
$ git apply --index < ../patchfile
:18659: space before tab in indent.
        int data;
:18660: space before tab in indent.
        /* Other data fields */
:29742: space before tab in indent.
    apq8016
:29743: space before tab in indent.
    apq8074
:29744: space before tab in indent.
    apq8084
warning: squelched 352 whitespace errors
warning: 357 lines add whitespace errors.
$ git status | head
On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged ..." to unstage)
        modified:   .clang-format
        modified:   .gitattributes
        modified:   .gitignore
        modified:   .mailmap
        modified:   COPYING
$ git checkout -b tmp && git commit -q -m apply
Switched to a new branch 'tmp'
$ git diff evl/v5.4 tmp
$

As you can see, this diff (where I swapped the order), applied with --index (using -3 or --3way would work as well as they set the --index option) suffices.

The reason --index is required—whether explicit or implied—is that the patch itself creates files that are listed in a .gitignore file. Specifically, the tools/perf/lib/include/perf/* files are all ignored. And yet, these files are in the commit at the tip of evl/v5.4, and hence in the diff as new files. So when Git is applying the diff, it creates these files.

If you apply the diff without --index, Git applies the diff to your work-tree (only). You must then use git add to add the updated files. But since the newly created files are listed in a .gitignore, they get ignored if you add them separately. The entire tools/perf/lib/include/perf/ directory does not exist in master so there are no such files in the index of the currently-checked-out commit. Those files are in the commit at the tip of evl/v5.4, so if you run git checkout evl/v5.4, they wind up in Git's index: a git checkout copies all the files from the selected commit to the index, even if those files are nominally ignored. But our git apply method does not copy those (new) files into the index unless we use --index, and then a subsequent git add *obeys the newly created tools/perf/.gitignore file:

$ cat -n tools/perf/.gitignore
     1  PERF-CFLAGS
     2  PERF-GUI-VARS
     3  PERF-VERSION-FILE
     4  FEATURE-DUMP
     5  perf
     6  perf-read-vdso32
     7  perf-read-vdsox32
     8  perf-help
     9  perf-record
    10  perf-report
    11  perf-stat
    12  perf-top
    13  perf*.1
    14  perf*.xml
    15  perf*.html
    16  common-cmds.h
    17  perf.data
    18  perf.data.old
    19  output.svg
    20  perf-archive
    21  perf-with-kcore
    22  tags
    23  TAGS
    24  cscope*
    25  config.mak
    26  config.mak.autogen
    27  *-bison.*
    28  *-flex.*
    29  *.pyc
    30  *.pyo
    31  .config-detected
    32  util/intel-pt-decoder/inat-tables.c
    33  arch/*/include/generated/
    34  trace/beauty/generated/
    35  pmu-events/pmu-events.c
    36  pmu-events/jevents
    37  feature/
    38  fixdep
    39  libtraceevent-dynamic-list

Line 5 tells Git to ignore all files in tools/perf/lib/perf. So git add . ignores them and the new commit doesn't match the tip commit of evl/v5.4.

We can put it another way: you can create a commit whose files wouldn't be accepted by the commit. For instance, any commit whose top level directory contains a .gitignore with the line * won't add any of the files that are in the commit. Yet that commit will contain the files it contains, and checking it out will get you a commit with those files. It's just that extracting those files into an otherwise-empty repository, then using git add, won't make a commit that stores the same tree. The commit you'll get is path-dependent.

I consider such .gitignore files suspect at least, and faulty in general, although some people think it's just fine (because you can use git add -f to override the ignores, or temporarily move the .gitignore file out of the way, or whatever). This particular linux-evl commit is one such commit, and it tripped up both of us at first.

How do I really apply a patch created with Git diff?

Answers (1)

On a variant of this topic...

Related Questions