Reputation: 75

No differences in git diff but shouldn't be there some due to changed line endings?

it might be that my question comes from misunderstanding Git in certain aspects. The question came in my mind when i was dealing with changing CRLF to LF line endings on my Mac due to changes on a Windows machine.

1) I started by initializing a new repository on OSX and put all files into that were affected by CRLF line endings.

2) Did the first commit, since core.autocrlf = input is set, git automatically changed the line endings to LF

The files in my local working tree still had CRLF line endings but the solution was also provided here (How to normalize working tree line endings in Git?):

Delete the files in the Index and restore the index + working tree based on the last commit:

git rm --cached -r .
git reset --hard

Now the confusion takes place: My first commit of 1) contains the converted LF line endings whereas my local tree and the index does not. Hence my expectation was that git is supposed to show differences between working tree/index and the repository. But

git diff HEAD
git diff --chached

did not list any changes?

Upvotes: 3

Answers (2)

torek

Reputation: 488013

I mentioned some of this in a comment, but it needs a lot of room for a real answer to cover it properly. One of the behaviors that seemed odd to you was this:

Often, committed files have LF-only line endings.
Often, work-tree files have CRLF line endings (as Windows users tend to prefer).
These can be true at the same time, and yet, git status and git diff will not mention any change in the line endings.

This behavior is necessary and appropriate. It would be wrong for you to run:

git checkout master
git diff

and see a lot of diffs!¹ But the actual implementation here is very tricky and can result in some apparent weirdness.

There are several key elements that go into understanding this—and understanding a lot of other Git behavior too. You have already mentioned some of them, but let's go a bit deeper into the details and look at how Git manages line-endings. The things we need to discuss are:

the way files are stored inside commits: what I like to call a freeze-dried format;
the way files are stored inside the thing that Git calls, variously, the index or the staging area;
the way files are stored on your computer, in your work-tree, where you can see and work on/with them; and
how they go from one storage format to another.

This last step is the key to end-of-line issues but it's tangled up with the other items.

¹Nonetheless, sometimes that very thing happens, for other reasons. I'll touch on these here too.

Commits store freeze-dried files

Every commit stores a complete copy of every file—well, of every file that's in the commit, but that's obviously tautological. The idea behind this claim is that if you have files README.md and main.py, say, and you make a new commit where you've changed main.py but not README.md, the new commit still makes another copy of README.md anyway.

Obviously, re-committing every file every time would be a big waste of disk space. Git avoids this through a number of clever tricks. The first obvious one is that each stored file is compressed (as with gzip or bzip or rar; Git actually uses zlib compression). For most files, compressing them makes them take less space. Typical source code compresses quite well. Compressing already-compressed files tends to backfire a bit—one reason not to store compressed files in Git!—but doesn't make them enough larger to be a problem here, so Git just runs zlib deflate over everything.

The more important trick here, though, is that once Git has frozen a file into a commit, that file is absolutely, totally, 100% read-only. There's a strong technical reason for this, in that Git stores everything—all of what it calls objects—in a simple key-value database, where the keys are hash IDs formed by hashing the value, and the value is the byte-string that is the file's data, prefixed with the object type and size.² Since the key itself depends on the data, you literally can't change the data: if you try, you get instead a new and different object with a new and different hash ID.³ The old object is still there in the database, with its old key and old stored bytes: the compressed and frozen, i.e., freeze-dried, file is still there.

What this means is that Git never has to store the same file again after all. It can just re-use the file from the previous commit! That is, if we just made a new commit with a new and different main.py, well, Git had to write the new different main.py to a new freeze-dried object, but we made it with the same old README.md, so Git can just re-use the previous freeze-dried README.md.⁴

Git's term for these freeze-dried files is blob object. Blobs and commits are two of Git's four types of object. For completeness, the remaining two are tree and annotated tag, but we don't need to worry about those here. We only need to look at the blob objects, and because commits are what retain the blobs (indirectly—through tree objects!), commit objects (lightly).

²The prefix ensures that, e.g., commit <size>\000<commit data> has a different hash from blob <size>\000<copy of the commit's data>. Git wants to be able to extract the type from the object, so the fact that you can read out an existing commit and create a file with those contents and store it as a file, means that the type-prefix is necessary.

The hash function is a cryptographic one, in part so that you can't deliberately fiddle with it to create a collision, but mostly just to get really good hash distributions. Forced hash collisions are theoretically possible and could be a problem for Git in the future, so Git is moving to a longer and more-secure hash. See How does the newly found SHA-1 collision affect Git?

³Git checks that the hash ID it used to find the object matches the hash of the data, when extracting the data from the object. This acts as a data-corruption test: if the hash of the data, as retrieved by the key, does not match the original key, Git knows that the on-disk data are invalid and tells you that.

⁴Later, Git compresses these key-value-store objects even further, by taking objects that have been sitting around for a while and packing them into what Git calls a pack file. The objects in a pack file are delta-compressed against other objects in that pack file. To do the delta encoding, Git undoes the zlib deflation, finds overlapping byte sequences—there tend to be a lot of these in source code—and builds a delta encoded version that says take the old copy of the file and make these changes to it: a binary, byte-coded variant of what you see as a git diff. These deltified pack objects then all go into a single pack file. There's a huge amount of effort that goes into deciding what gets deltified against what: it's not just "new version of file vs old version of file".

Higher level Git software just says get me the object with hash ID H. If the object exists as an unpacked object, Git gets that while re-zlib-inflating it. Otherwise Git looks at each pack file. If the object is there, Git can re-assemble it from its deltified pieces, all from that one pack file. The code one level up never has to know whether the file was a single object, or pieces stored in a pack. Hence, it's accurate to say that at the object level, Git only does zlib compression, without delta compression. Delta encoding, if it happens at all, happens below the object level.

Freeze-dried files get rehydrated into the work-tree

This part is pretty straightforward: there's just one wrinkle, which we'll leave for the next section. A commit is a snapshot of every file, but they're all in this Git-only freeze-dried form. They're totally frozen, which is fine for archival, but until converted back, can't even be used; and as long as they're frozen, they're no good for getting any new work done. So they have to be rehydrated, as it were: turned back into ordinary files, stored in ordinary directories / folders, in whatever way your particular OS requires. The result of rehydrating the freeze-dried committed files is the work-tree.

The index / staging-area

Here's where the wrinkle I mentioned comes in. Rather than directly extracting files to your work-tree, Git first extracts the commit into what Git calls the index (in some places) or staging area (in other documentation). What the index is and does gets more complicated during merge operations, but for the most part, it's simple to describe: it's the proposed next commit.

When Git goes to make a new commit, Git does not use what's in the work-tree. There are version control systems that are similar to Git that do use the work-tree as the proposed next commit, and they tend to be a lot easier to use, but also a lot slower. When using these, you tell the system make a new commit, and it essentially goes and freezes every file again, into a new commit.

Git, on the other hand, says: Hey, wait! We already freeze-dried most of your files. Instead of re-freeze-drying every file on a new commit, let's force you, the user, to do it on the specific files you changed, by making you run git add on them! So Git starts by extracting every file to the index, before rehydrating in into the work-tree. The git add command freeze-dries a file from the work-tree and copies it into the index, replacing the one that was already there from the earlier commit—or, if it's a new file, creating a new file in the index that wasn't there before. Either way, now that file is ready to go into the next commit ... and so are all the files you didn't git add. They're still there from the git checkout, ready to go into the new commit.

This is where all that craziness about tracked vs untracked files comes from. A tracked file is simply any file that is in the index right now. An untracked file is any file that is in the work-tree right now, but not in the index right now. At any time, you can put one file into the index right now: git add file. At any time, you can take one file out of the index right now: git rm file or git rm --cached file. Using git rm takes the file out of both the index and the work-tree, while using git rm --cached takes the file out of the index only, leaving the work-tree file alone.

Of course, other things you do also modify the index. The most obvious one is that git checkout often has to replace the index, or at least parts of it. These details can get very tricky—see Checkout another branch when there are uncommitted changes on the current branch—but it really does all boil down to putting files into the index, or taking them out, along with putting files into the work-tree, or taking them out, or (e.g., git rm --cached or git reset --mixed) leaving the work-tree alone while changing stuff in the index.

Regardless of how the index changes—or doesn't change—the main thing to keep in mind is this: At all times, there are up to three active copies of each of your files:

One copy is the freeze-dried one in the current (HEAD) commit. You can view this with git show HEAD:file. You cannot change this file, at all, ever—all you can do is change the commit that the name HEAD calls up, by creating new commits, or using git checkout to move to a different commit.
One copy is the freeze-dried one in your index. You can view this with git show :file or git show :0:file.⁵ You can replace it with a new one from your work-tree using git add.
The last copy is the normal everyday read/write one in your work-tree. You can use any of your regular non-Git commands on this.

I say up to three here because, e.g., of course an untracked file isn't in your index (whether or not it's in the HEAD commit), or a totally new file that's never been committed yet might be in both the index and work-tree but not in HEAD. It should, in general, be obvious how many copies there are, in each situation.

Note that the index actually just holds the blob hash ID of the freeze-dried file, which is already saved in Git's object store. If you commit the file, the blob hash becomes permanent, as the commit itself now uses it. Otherwise the object can eventually expire (though not while its hash remains in the index).⁶

⁵The number zero here is the staging number, which has to do with merges. The default number is zero, and except during merge conflicts, everything is always just in staging slot zero—so you can use :0: or just : to mean in the index.

⁶There was a very nasty bug in git worktree add for a while. The garbage collector did not account for the extra index file, nor the per-worktree refs, associated with each work-tree. It never scanned these extra index files and refs, and if any particular hash appeared in only such an index or ref, Git would sometimes expire such objects, even though the added work-tree needed them! This was fixed in Git 2.15.

Line-endings, and smudge and clean filters

Now that you're used to the idea that Git stores, at all times, up to three copies of each file, now we can see how the end-of-line manipulations in Git work. Moreover, we can see how you can define smudge and clean filters, and how they work.

The process of taking a file from the freeze-dried form in the HEAD commit and putting it into the index is really simple: Git just determines the relative path of the file, such as README.md or dir1/dir2/file.py, and makes room in the index at the appropriate place—the index is carefully arranged for fast access—and stuffs the key information about the freeze-dried copy there. Git also stuffs a bit of information about the work-tree copy into the index entry for that file, as we'll see in a moment.

Since the index just holds the hash ID of a freeze-dried file, what's in the index is exactly what will be in the next commit, if you make it right now. If what's in the index came out of the HEAD commit, it's exactly what is in the HEAD commit.

As with all frozen, hash-ID-keyed objects, nothing here can change. You can make a new and different object with a new and different hash ID, and since you can write new hash IDs into the index, you can replace the index copy wholesale, but since you can't stuff new hash IDs into an existing commit, you can't change the commit. If you do change the index, you change it to exactly what you propose to put into the next commit.

Meanwhile, what goes into the work-tree is a rehydrated copy of the file. The committed and index copies are freeze-dried: they're in a Git-only format. The work-tree copies are ordinary. There's a transformation that absolutely must take place, during the take out of Git, put into work-tree process, every time. There's a corresponding transformation that absolutely must take place, during the git add freeze-dry the file and stuff it into the object store and index process, every time.

So: why not, during that transformation process, also do any end-of-line filtering? And that's exactly what Git does:

Copying a file from the index to the work-tree (git checkout, mostly): if the work-tree file should have CRLF line endings, Git can turn LF-only line endings in the blob, into CRLF line endings in the work-tree. In fact, it can insert any arbitrary "dirty" stuff you might like to have, through your smudge filter. We can, in general, refer to this as smudging files.
Copying a file from the work-tree to the index (git add, mostly): if the committed file should have LF-only line endings, Git can turn any CRLF-endings into LF-only endings while writing the blob object. In fact, it can "clean out" any "dirt" you added in your smudge filter, through your clean filter. We can refer to this as cleaning files.

Git provides three built in line-ending smudging and cleaning modes here. If you want others, you have to write your own smudge and clean filters:

Do nothing: keep index and work-tree matching. This is appropriate for all binary data. It's also appropriate, in general, on Linux systems, where lines shouldn't have CRLF endings in the first place, so if everything in the repository always matches everything in the work-tree and nothing ever has CRLF endings, there's never any problem.
Do LF-to-CRLF on write-to-work-tree, and CRLF-to-LF on write-to-index. This is appropriate for some text files for Windows users.
Do nothing on write-to-work-tree, but do CRLF-to-LF on write-to-index. This is the mode Git calls input. It's not especially appropriate for anything, in my opinion. This may be why input is mostly a backwards-compatibility feature. You can set the same mode with eol=lf in a `.gitattributes file, though.

`git diff` and `git status` vs smudge/clean/etc

What git diff does—or is intended to do—is mainly:

compare an entire commit to another entire commit; or
compare any commit to the proposed next commit (i.e., the index); or
compare any commit to the work-tree; or
compare the proposed next commit (the index) to the work-tree.

Several of these operations work exclusively with blobs—freeze-dried files in commits or in the index. This is easy, comparatively speaking: they're already in whatever form they will always be in. There's no end-of-line fiddling, or smudging or cleaning, to do. But anything that compares a commit or the index to the work-tree has a problem, if an end-of-line or smudge filter has changed what's in the work-tree.

There are two obvious ways to deal with this problem. Git could:

clean the work-tree files (by adding them somewhere, e.g., to a temporary index), then compare the cleaned files; or
re-smudge the index or commit copies (by extracting them somewhere, e.g., to temporary files), then compare the smudged files.

Both of these are slow: they mean re-copying every file that uses these features, every time you compare something to the work-tree. Git will do this when necessary (and based on the source, it can do either one—I'm not sure off hand which one happens when). But Git tries to be more clever than that.

If you've just now checked out a file—just copied it from the index to the work-tree—the work-tree copy must, by definition, match the index copy, regardless of how "smudgey" the work-tree copy is. Similarly, if you've just now git added a file—just copied it from the work-tree to the index—the index copy must, by definition, match the work-tree copy, regardless of how "clean" the index copy is. Git saves, in the index, a bunch of OS-level information about the work-tree copy of a file, as compared to the index copy of the file. If these two match, Git gets to assume that the index and work-tree copies match.

Note that Git retains this assumption in key cases, even if it shouldn't. In particular, suppose you have a committed file that has LF-only line endings, and you configured your repository with .gitattributes and/or other settings that told Git: When copying this file either way, do LF / CRLF translation as appropriate for the direction of copy. Since then, you changed the .gitattributes or other settings so that if Git were to re-extract the file now, it would do nothing, and if you git add the file now, it will do nothing—which would add a version of the file with CRLF line endings, to the index.

Git will insist that the index and work-tree copies of the file match, even though they no longer do. If you change the settings back to a mode where Git will do the translation, now the files match again. At all times, Git keeps insisting that the files match—because it's using the index's file-status information to bypass doing the hard work to really check.

The git status command consists, in part, of running two git diff commands, one to compare HEAD to index, and one to compare index to work-tree. The first diff has no line-ending issues, so there is nothing to worry about here, but the second has the usual index-vs-work-tree issues. It actually uses the same code as git diff, so it behaves the same way in terms of thinking things are clean or not.

`git add --renormalize`

The git add command takes similar short-cuts in some cases. This lets you do things like git add . without having Git re-compress and freeze-dry every file in your work-tree: it only re-compresses and freeze-dries files that, based on time stamps and such, look like they really need it. This of course works badly if you changed the cleaning setup, because files might need some real cleaning when Git thinks they're already clean.

The git add --renormalize operation tells Git: Defeat the special case code. Don't believe that index and work-tree are the same based on OS file time-stamps and such; really do the add, really applying the cleaning process. So that's one easy way around this problem, if and when it occurs. (I have seen reports here on StackOverflow of it not working, but never with a reproducer.)

These aren't the only sources of problems

Note that it's possible to:

commit a file with actual CRLF line endings
later, instruct Git that it should extract and write such files with LF-only line endings
get into a state where, after extracting a file, it shouldn't be considered "clean"

and sometimes, depending on OS vagaries, this really does happen in spite of Git's attempts to be clever with file time-stamps and such.

More often, though, you will see a case where:

$ git clone <repo>
$ cd <the-clone>
$ git status

shows modified files when you're on a Windows or MacOS system, where you have a case-insensitive local file system, and you've just cloned a repository that was written on a Linux system with a case-sensitive file system.

The Linux user can make a commit that has two different files whose names differ only in case, e.g., README.MD and ReadMe.md. When your Git, on your Mac or Windows system, goes to extract these two different files to your work-tree, it creates one of them first—typically README.MD—and then goes to create the other one, ReadMe.md, but ends up overwriting the contents of README.MD with the contents from the committed (now index-copied) ReadMe.md.

What you see is a modified README.MD, with an unmodified ReadMe.md, because your work-tree has only the one file named README.MD with the contents from the committed ReadMe.md.

There are no good solutions to this problem other than to get your Linux colleagues to stop doing that. Git probably should have some fancy way to handle it, but it doesn't. It is possible to work your way through this without resorting to booting a Linux system, but bringing up a Linux VM is by far the easiest way to deal with it.

Upvotes: 6

VonC

Reputation: 1323933

The working tree still has CRLF line endings as long as no reset (and deletion of index) takes place.

Not with core.autocrlf = input on: the checkout part (which fills your working tree) would have change the eol to your system eol.
See this conversion table.

Reminder: never use core.autocrlf: it is a local configuration which apply to too many (Ie *all) files.
Use a gitattributes core.eol directive: it is a settings part of the repository cloned, which means you don't have to set anything locally.

Reminder: don't rm/add. With Git 2.16 (Q1 2018) and more, you have:

git add --renormalize .

Upvotes: 1

No differences in git diff but shouldn&#39;t be there some due to changed line endings?