Arakis
Arakis

Reputation: 889

Does Git internally store Line Endings?

Does Git internally store specific Line Endings?

My guess is, Git is storing only the line data itself (line ending neutral), and depending on the operationg system, it will use the platform specific line endings.

Or in other words: Can line endings every be stored "mixed" within the same file?

I'm not talking about settings within the .gitattributes file.

Upvotes: 2

Views: 982

Answers (2)

torek
torek

Reputation: 488133

As iBug said, Git just stores a raw data block. Linus originally didn't have any sort of CRLF line ending hackery in Git: the initial release in 2005 did not do it at all. The first line-ending conversion code was introduced into Git in Feb 2007, in commit6c510bee2013022fbce52f4b0ec0cc593fc0cc48. The .gitattributes file itself was introduced a bit later, in April 2007, in commit d0bfd026a8241d544c339944976927b388d61a5e.

The real key to understanding these, though, is to make note of the difference between the index copy of each file, and the working tree copy of each file. Remember that the index holds, in effect, the proposed next commit, or at least its snapshot. (The metadata for the next commit is generated at the time you run git commit or whatever other command it is that makes it.) The contents of any existing commit are sacrosanct: nothing, not even Git itself, can change them at all.

The extraction side

When you first check out some commit—with, e.g., git checkout branch or git switch branch, or the same with a raw hash ID (though git switch demands the --detach flag for this case)—Git will fill in Git's index from that commit, and fill in your working tree. (Any previous commit in Git's index and your working tree is first removed from both places, with some fancy caveats outlined in Checkout another branch when there are uncommitted changes on the current branch.)

The index gets exactly what's in the commit. That means that if a committed file has a weird mix of LF-only lines and CRLF lines, or is a binary file like a JPG with random sprinklings of binary data that a naive program would think are line endings, well, that's what goes into the index too. More precisely, the "copy" of a file that's in the index is really just the raw hash ID of some blob in Git's database. The blob object that holds some existing, committed file is read-only, and hence easily shared. So to do the initial checkout, Git just lets the index share that copy. The blob hash ID goes into the index, stored in the slot-zero entry under the file name listed in the commit.1 This blob is stored in Git's object database, and is either compressed—in the form of a loose object—or very compressed in the form of a packed object. Git can read either one; nobody and nothing can write over one, although Git can make new (different) loose objects pretty easily.

The working tree copy, however, is a different story. Git must decompress the blob object. This means reading the compressed blob bytes and running the zlib decompression code over that, to get different bytes representing the file contents as you would like to see them. Because Git is already doing this work, this is an ideal place for Git to do a bit more work: Git can replace LF-only line endings with CRLF line endings.2

So, as Git extracts an index copy into your working tree, Git can turn LF-only lines into CRLF lines. If some file is marked (via .gitattributes or any other way) as needing this conversion, Git does it; if the file has an LF that's not preceded by a CR first, Git ensures that what it will write to your working tree file has a CR first, then that LF.

This is the git checkout side of things.3 Let's pause a moment for footnotes and then look at the git add side of things.


1Technically, the commit lists the tree hash ID for the tree object that represents the snapshot. That tree object contains name component pieces and hash IDs. The hash IDs may be those of sub-trees, containing more name-component-pieces, or of blobs that represent files that should be checked out. Well, the hash ID might be that of a symlink or gitmodule, but those are relatively rare: the common tree entries are for subtrees and for file blobs.

2This conversion—LF-only to CRLF—is the only line-ending conversion built in to Git's file extraction code. You can add your own arbitrary additional conversions using Git's smudge filters, but those are up to you to write. (Note that Git-LFS uses a pre-written smudge filter that extracts "large" file systems from a separate storage area elsewhere on the web. This is an add-on to Git, not part of Git itself.)

3Until the new git restore command went into Git 2.23, only index-to-worktree conversions did this kind of conversion. Now that git restore can extract a file straight from a commit to your working tree, there's one more place that can do it. Note that git checkout -- path/to/file or git checkout commit -- path to file writes first to Git's index, and only then to your working tree, so this particular code path goes through the index-to-worktree functions. That's why this new git restore feature is worth a footnote: until Git 2.23, you had to have Git scribble on Git's index first; now you can avoid that.


The git add side

When you run git add—including the implied add from a git commit -a, for instance—Git does the actual hard work at this time, rather than waiting for a later git commit. If you're using git commit -a, that commit is going to happen within milliseconds, probably, but the logic is still the same: first, Git does a git add.

The point of git add is to update Git's index. We must update the index—the proposed next commit (or its snapshot)—first. Only once the proposed commit matches what we want to commit, can we commit that.

Since the index holds path names and blob hash IDs (and file modes), Git must, at this time, turn your working tree file into a blob. To do that, Git has to start out by producing that blob as a new loose object—or at least, figuring out what its hash ID would be, if it did this. It turns out that the quickest way to figure out what the hash ID would be is to go ahead and begin writing the object, compressing—with zlib's compressor—while computing the hash. Since we don't know the object's name (hash ID) yet, we just use a temporary name: .git/objects/tmpXXXXXXX with the Xes filled in with something unique, for instance. (The precise temporary name doesn't matter here.)

To feed data to the compression and hash functions, however, Git has to read the working tree copy of the file. If the working tree copy is marked (via .gitattributes or whatever other mechanism) as needing conversion, well, this is a perfect time to check for CR followed by LF, and drop the CR part so that we get LF-ony line endings. This way, both the hashing function and the zlib compressor will get LF-only lines.4

Once the entire file is fed through the hashing function and compressor, Git now has the right blob hash ID. Git checks to see if the object already exists as a hashed object. If so, Git just re-uses that existing object, deleting the temporary file.5 Otherwise, Git renames the temporary file to make it a valid loose object in .git/objects/. Now that there's an object for that file, that's what goes into Git's index. Note that at this point, the entire file has been converted to LF-only line endings, regardless of what any previous commit had.


4As before, this "delete a CR that is immediately followed by an LF" filter is the only one built into Git here, in these code paths. You can do arbitrary filtering yourself, with a clean filter, but you must write that yourself. (Unsurprisingly—if you've read and understood footnote 2—this too is provided automatically as an add-on by Git-LFS: if a file is "big" and is to be stored elsewhere, Git-LFS's "clean" filter stores the file somewhere else, and produces the LFS data, so that a later checkout can retrieve the file, as the "cleaned" data.)

5Git should probably check—perhaps optionally since this could be slow—that this is not the result of a hash collision. I don't think it currently does so. The chance of an accidental hash collision is tiny enough to be negligible, but given the existence proof of breaking SHA-1, it seems to me that an optional check would be a good idea.


This all only happens when it's enabled

For Git to make changes to a file, the file has to be marked as "should be modified". Setting core.autocrlf to true can mark some files: Git will now attempt to guess whether some file is text or binary. Listing files in .gitattributes can mark some files, as specifically text, or specifically binary, and also mark them for specific conversions.

Again, though, the only built in conversions are:

  • extract: turn LF-only to CRLF
  • add: turn CRLF to LF-only

Some settings enable both conversions, and some enable only the add-side conversion (the old crlf=input, in old versions of Git, and eol=lf in modern Git).

Note that since blob-to-working-tree extraction never removes a CR-before-LF, existing committed files that have CRLF endings in their internal blob form (either consistently, or mixed) are always checked out as files that have CRLF endings in the working tree. If you do not touch a checked-out file that is in this state, Git notices that you have not touched it, and does not add the file, so it continues to have the mixed or consistently-CRLF endings in the next commit.

The git add --renormalize flag is intended to force Git to re-add files even if they appear to be untouched. This way they get run through the CRLF-to-LF-only conversion, if that's set up.

Upvotes: 5

iBug
iBug

Reputation: 37227

Git store whole files as "blobs" and only do conversion when reading (checking out) or writing (indexing). Git does not store "neutralized lines".

So yes, Git can save files with mixed line endings, but that's usually a bad practice to avoid.

Upvotes: 8

Related Questions