Easiest way to update one file from LF to CR/LF on git pull?

Question

There are many questions concerning EOL conversion but I can't find answer to this particular situation: I have a readme.txt in Unix line endings. This text file is part of a repository which is deployed on users' machines and updated using simple git pull.

We realised that this file should be always in CR/LF and therefore would like to change it from LF (other files are fine as they are). Updating .gitattributes with

readme.txt eol=crlf

works but only if the repository is cloned. If I want to update it, I have to do

git pull
git rm --cached readme.txt
git reset --hard

i.e. something I can't do on every user's machine. Is there a way out of this? Would an update to readme.txt help here?

torek · Accepted Answer

It's not at all clear to me why you care what appears in each user's work-tree. All that matters when using Git is what appears in each commit. Still, let's answer the question as asked:

Would an update to readme.txt help here?

Yes, it would. (The remainder of this answer is all optional reading, but probably a good idea.)

Why this is the case

The eol=crlf attribute tells Git that when the file is copied from the index to the user's work-tree, Git should find -only line endings in the frozen format copy and replace them with line endings in the user's work-tree.

That's not to say that what you stated is wrong, but what you stated isn't quite right either. :-) In fact, it's incomplete. To really understand this requires understanding how commits, the index, and the user's work-tree interact.

Commits

Remember that Git's most basic purpose—its reason for existence at all—is to store commits. Each commit contains a complete snapshot of every file. More precisely, a commit contains a full snapshot of every file that's in that commit. Put this way, it sounds redundant—but the idea is that this is the equivalent of an archive of those files as they existed at that time. Each commit could have a completely different set of files, but that's not typically how we use Git.

You could build such a thing naively out of an archiver like rar or tar or zip or whatever, by, every time you wanted to make a commit, just making a new full archive. Each such archive would be completely independent of every previous archive. That makes them easy to get back later. The drawbacks are that these would take a lot of space, and be easy to lose track of.

We first observe that each commit tends to re-use most of the files from the previous archive. What if, instead of making an independent archive, we made one that just re-used the previous one wherever possible? And in fact, Git does this.

To make this work and be fast, Git adds several more tricks. The main one is that every file's data—its content—is stored in a compressed, read-only, Git-only format that makes it very fast to see if Git already has a copy of that file. Because it is read-only—in fact, every part of every commit is read-only—it's quite safe to re-use an old copy of a file, based on looking up its content.

I like to call this read-only, Git-only, compressed format "freeze-dried". It makes it clear that you can't actually use this data until you first restore it to normal everyday format, "rehydrating" it. (Instant file: just add water!)

The index and your work-tree

The committed copies of each file are all squirreled away in a database.¹ When you check out or switch to some commit, Git copies the files out of the database. This rehydrates them and makes them useful.

Git could stop here, with these two sets of entities: commits, and the work-tree. The commits are read-only and the work-tree is where you get work done. You'd build new commits from the work-tree. Other version control systems do just that ... but Git doesn't. Instead, Git inserts, between the current (or HEAD) commit and the work-tree copy, a third copy of each file.

This third copy—which is actually in the middle, between the other two, so maybe it's the second copy—is in the freeze-dried format, but unlike the copy inside a commit, you can change this copy. More precisely, you can replace it. This middle copy is stored in what Git calls, variously, the index or the staging area (or, rarely now, the cache).²

The index has multiple roles—perhaps the source of its multiple names—but its main one can be described as where you build the next commit you will make. Since it starts out matching the commit you checked out, it already has every file ready to go into a new commit. But suppose you change a work-tree file in some way. It doesn't matter how, it only matters that you have changed it. This work-tree file isn't in the index yet.

You will have to run git add on the updated work-tree file. This copies the file back into the index, compressing it and turning it into the freeze-dried format. That boots the previous copy out of the index. Now the index contains the updated file, and the index is again ready to go into a new commit.

When you run git commit, Git collects the appropriate metadata (your name and email, log message, current commit hash ID, and so on) and makes a final frozen snapshot version of the files that are in its index. Since those files are already in the frozen format, this process is very fast, especially compared to other version control systems that don't have a pesky "index" in the way.

When you extract a different commit by going to another branch or "going back in time" to a historic commit, Git has to update the index to match the commit, and update your work-tree to match the index. That means it has to copy each file from the index, to the work-tree, rehydrating it along the way. Likewise, as we just saw, git add has to copy a file from the work-tree, to the index, dehydrating / freeze-drying it along the way. This has several key implications for our crlf line endings, or more generally, for smudge and clean filters (which you also set up using .gitattributes).

¹This is Git's object database. The files' names are stored in what Git calls tree objects, with the content in blob objects, all tied together by Git's commit objects. This unifies the various pieces in one big content-addressable object system, which Git presents to you as a series of commits.

²Technically, the index contains not an actual copy of each file, but rather a mode (+x or -x rendered as 100755 or 100644), a file name (complete with embedded slashes: path/to/file.ext), and a blob hash. The blob hash is for the frozen, compressed file contents: the freeze-dried form of the file's data. When the data match those of any file in any existing commit, the blob hash is the same as that of the existing file in the existing commit.

As long as you don't get into the details of the index using git update-index or git ls-files --stage, though, you can just think of this as an extra copy, in the freeze-dried format. Everything else works out the same.

Filtering, including line endings

What if, during the extract freeze-dried data process, we had Git replace newline-only line endings with CRLF line endings? This is part of the "smudging" process: taking a clean file, stored in a commit and now in the index, and "dirtying it up" to put it into the work-tree, as a user-editable, user-usable file.

What if, during the compress regular file down to freeze-dried format, we had Git replace CRLF line endings with newline-only line endings? This is part of the "cleaning" process: take a dirty file, stored in the user's work area, and "clean it" to put it into the index, ready to be committed.

This is what the eol= settings do. They do not, and can not, change any existing committed files. Those are already inside commits and are frozen for all time.

This is also where your issue comes from.

Optimization

When you switch from some commit a123456... to some different commit b789abc..., Git could:

remove every file that is in the index from the index and work-tree
re-populate the entire index and work-tree from the new commit

and that would get you the commit you wanted checked out. But that would be extremely slow and have annoying side effects on the time stamps on every file.

Because of the way Git stores the files in commits, however, it's really easy for Git to tell whether some file named path/to/file.ext or whatever, that is in the index right now because of commit a1234567... needs to be different—or removed entirely—because of what's in b789abc... for path/to/file.ext.

If the file doesn't have to be different, Git just leaves it alone, in both the index and the work-tree. If the file does have to be different, Git won't let you switch from the current commit, a123456..., to this other commit b789abc... unless the index and work-tree copies of the file are "clean", i.e., match the current commit. (There are a lot of tricky corner cases here. See much more at Checkout another branch when there are uncommitted changes on the current branch.)

This means that it's important whether all three copies—HEAD commit, index, and work-tree—match, or not. The introduction of filters and end-of-line conversions make the word match tricky though. Git will look at saved file system time stamp data, cached in the index,³ to decide if the file is "clean", in some cases.

The true "clean-ness" of files depends in part on what kind of EOL conversion, if any, you have chosen. However, changing the .gitattributes file (or changing the smudge and clean filters) is not something Git actually notices, so if you change EOL setings, Git can think a file is "clean" when it's not, or vice versa.

In your particular case, you've added a new setting to .gitattributes that says when the file gets copied from index to work-tree, change to ; when the file gets copied from work-tree to index, change to . So if Git noticed, it would check these things ... but Git doesn't notice.

When a user who has the existing repository out, at commit H1 (for some hash) that is the tip of, say, master, and that user runs git pull, his Git—I'm assuming the user is male—contacts the other Git over at origin and fetches new commits. That brings over a commit whose hash is H2 (some other hash) that is the tip of origin's master. His Git then runs git merge on the hash ID H2 to combine any work/commits he has done with this other work.

Assuming he has not done any work since H1 and H2 has H1 as its parent commit, his Git does a fast-forward operation instead of a merge, which amounts to doing a git checkout of commit H2 that drags his branch name master forward to point to commit H2. So now Git employs that optimization. The file .gitattributes has a different blob hash and his index and work-tree copies of .gitattributes must be replaced. Since Git believes (correctly) that these are clean, they are replaced. His Git's index copy of readme.txt, however, has the same blob hash as new commit H2. So his Git doesn't touch his index or work-tree copy of readme.txt.

The result is what you see: the work-tree copy continues to have whatever line endings it had before.

If the two commits H1 and H2 have different content for file readme.txt—note that this means different cleaned content—then his Git's fast-forward operation will see that his copy of readme.txt, in his Git's index and his work-tree, do need to be replaced. As long as his Git thinks they are "clean", his Git will replace them. This means copying the committed readme.txt into the index, and then copying the index copy to his work-tree: this copying will obey the new eol=crlf action and will replace newline-only "clean frozen file" data with CRLF-ending work-tree data.

If the user subsequently edits his work-tree readme.txt, he—or his editor, at least—will see these CRLF endings. What his editor does with them is up to his editor. (I force my editor to show them to me, and then I strip them out because I don't like them and I don't care that you want me to have them. :-) ) If he updates the file and runs git add, his git add will strip away those CRLF endings, replacing them with newline-only endings, the way files should be; that's what will go into the index, and hence what will be in the next commit.

³Hence the rarely-used name cache for the index. In modern Git, the term cache mostly refers to the in-memory copy of the index, though, as loaded from the index file and then operated on by whatever Git command you're running.