How does git manage to track changes if the complete folder structure is changed?

Question

I am working on branch and I recently refactored my folder structure, lot of files were moved here and there, many were renamed as well. But when I merged master(old structure) to my current branch git was able to understand where the files are and automatically merged the code without conflicts. How is this possible?

torek · Accepted Answer

There are two parts to this "how", which can be summarized this way:

For committing, Git doesn't care. It just makes snapshots. Whatever file names you told Git to use to hold whatever file contents you told Git to use—all of which are stored in your index at the time you run git commit, which is why you have to git add files all the time, to copy them over the old versions in the index—Git just freezes all of those files, with whatever they have in the index right now, when you run git commit. Those frozen files are the snapshot for the commit.

In other words, after you rearrange everything with git mv (which changes the names in the index and work-tree both), and update the contents with git add as needed, and then git commit, you get a new snapshot with the new names and any updated contents. The old snapshots remain exactly as they were: all existing snapshots are frozen forever, or at least, for as long as the commit itself lives. (The default is for them to live forever. It's possible to remove commits, but this is only easy-ish until you spread them to other repositories, after which they will keep re-infecting your repository from the other repositories even if you delete them from your own.)
For comparing—which includes merging—Git has to discover / detect renamed files. Git does this by comparing the files' content.

The claim in the second bullet point here is actually a bit of an exaggeration, but let's see how it works by illustrating with git diff, which—at least in the mode we care about here—compares two commits. Remember that each commit represents a complete snapshot of all files. We'll find the two commits' hash IDs, and run:

git diff --find-renames

What Git will do at this point is extract each of the two commits. (The "extracting" gets short circuited as much as possible, which is usually a lot: Git can usually examine the frozen commits directly in place. But that's just a speed optimization; you can think of this as Git completely extracting both commits into a temporary work area.) Let's call the earlier commit old and the later one new here.¹ The job of git diff is to tell you what to do to change old to new. This isn't necessarily what any human who did the changing did, just some set of instructions that will produce the the same result.

To find these instructions, Git will:

First, find all the files that have exactly the same name in old and new. Git assumes that if old has a file named README and new has a file named README, these must be "the same" file. These files are paired-up: they're now taken out of the equation, for the moment. Git hasn't yet figured out what to do to change the paired-up files, it's just paired them up.
(There's a step you can insert here using the -B option, to commands that have one. But we'll ignore it for now, as it just makes this complicated.)
Now, if there are unpaired files, these represent files that went missing from old and/or showed up out of the blue in new ... or do they? Maybe they are files that were renamed, that had some name O in old and some different name N in new. Here, Git computes a similarity index number for each possible pairing of files.

For speed purposes, Git can very quickly pair up any two files (one from old, one from new) that are 100% identical. That usually greatly shrinks the pool of files that have to be compared the hard way.

Finally, Git is down to unpaired files that it should consider pairing up even though they're not 100%, bit-for-bit identical. Git now does the full similarity index computation (using the same xdelta-like code that Git uses for doing delta compression in pack files) on every pair of files. (This requires really extracting all these files' data.) Whichever files get the best pairing score are paired together, if the scores exceed the minimum you chose, which defaults to "50% similar".
Files that remain unpaired after all of this extra work are either just deleted or created anew. (There are a few more complications introduced here if you add --find-copies or --find-copies-harder, but again, we'll ignore them here.)

Now that the files have been paired-up—i.e., now that git diff knows that, say, file README.md in old mostly matches file README.rst in new, so that these two files must really be one single file, with one identity, rather than two different files with two identities—now Git compares each paired-up file to produce instructions:

-: delete this line from the version of the file found in old
+: add this line to the version of the file found in old

If you follow all the instructions, including any "rename this file" instruction given at the top, that will change the file found in old to the one found in new.

¹You can, if you like, reverse the hash IDs. Then Git will tell you how to turn the newer commit into the older commit.

How `git merge` uses `git diff`

When merging two commits, Git uses the commit graph—the Directed Acyclic Graph formed by looking at each commit's parent hash or hashes to connect all the individual commits into one big DAG—to find the best common ancestor commit. This commit is the merge base of the two specified commits.

The git merge command then, in effect, runs two git diff commands. Both have --find-renames enabled, with the similarity threshold defaulting to 50%. You can use -X find-renames= to alter this threshold, to allow more or fewer pairings of files whose names did not match.

The two diffs are:

git diff --find-renames

and:

git diff --find-renames

Both diffs do the similarity computations as needed on any unpaired file names during the internal git diff.

The complications

Adding -B tells Git to break automatically-paired files: just because a file is named README in both commits doesn't mean that this is really the same file. What if, for instance, you renamed README to old/README, then renamed new/README to README? In this case, Git will do a similarity computation on the automatically-paired files in that step I noted earlier. If the similarity is too low, Git will break the pairing. Later, if the similarity is not too extremely low, and the pairing remains broken, Git will rejoin the two files, so -B takes two numbers, not just one.

The merge command does not allow you to supply a -B argument. (Arguably, it should.)

If you use --find-copies or --find-copies-harder, Git will look at some or all of the source ("old") files to see if a newly-created destination ("new") file was copied from it. These use the same similarity index. This step happens after rename detection, and will sometimes only consider modified files as possible sources, again because it's expensive computationally.

The merge command does not allow you to specify the find-copies options either.

How does git manage to track changes if the complete folder structure is changed?

Answers (1)

How `git merge` uses `git diff`

The complications

Related Questions

How does git manage to track changes if the complete folder structure is changed?

Answers (1)

How git merge uses git diff

The complications

Related Questions

How `git merge` uses `git diff`