Andy An
Andy An

Reputation: 49

what does " filter=lfs diff=lfs merge=lfs " do in .gitattributes?

I saw code here
https://gist.github.com/Srfigie/77b5c15bc5eb61733a74d34d10b3ed87

#Image
*.jpg filter=lfs diff=lfs merge=lfs -text
*.jpeg filter=lfs diff=lfs merge=lfs -text
*.png filter=lfs diff=lfs merge=lfs -text
*.gif filter=lfs diff=lfs merge=lfs -text
*.psd filter=lfs diff=lfs merge=lfs -text
*.ai filter=lfs diff=lfs merge=lfs -text
*.tif filter=lfs diff=lfs merge=lfs -text

Upvotes: 2

Views: 4633

Answers (1)

torek
torek

Reputation: 488519

(Side note: this isn't really a good topic fit for StackOverflow. I'm going to answer it anyway.)

Filtering, as a general purpose thing, is described in the gitattributes documentation:

A filter attribute can be set to a string value that names a filter driver specified in the configuration.

A filter driver consists of a clean command and a smudge command, either of which can be left unspecified. Upon checkout, when the smudge command is specified, the command is fed the blob object from its standard input, and its standard output is used to update the worktree file. Similarly, the clean command is used to convert the contents of worktree file upon checkin. [snip]

There's similar (not identical) verbiage for diff and merge attributes; we'll leave that out for explanatory purposes. Meanwhile, the -text tells Git that all such files are "non-text" files, which is certainly true for image files like JPG and GIF files.

So:

filter=lfs

links Git operations to "smudge and clean filters" (which are defined elsewhere, in .git/config or $HOME/.gitconfig or similar—the Git-LFS system, which "wraps" Git with these various filters to add on a service that Git itself is completely unaware of, installs all of this stuff for you).

Your question appears to be: What the heck does all this mean? How does this work at the Git level? You don't actually need to know any of this to use Git-LFS, but it's still a good idea to learn it, since if and when things go wrong, you may need to perform a bit of surgery, and it will be wise to understand where the heart and lungs and other important bits are before you start slicing up the patient. 😬

It's now time to dive into how Git stores files. This is something every Git user should know, at least at a superficial level, because it affects a lot of Git usage.

How Git stores files

To a first approximation, Git isn't really about files, but rather about commits. But each commit stores files. In fact, every commit stores a full snapshot of every file. It's as if each commit were an archive: a tar or zip or WinRAR or whatever, holding every file.

One could do this totally naively, by having git commit actually store every file as an archive. But this would be massively wasteful of space, since most commits, in most repositories, mostly hold duplicates of previous commits' files. That is, with a typical work pattern, we check out some commit—obtaining a full snapshot of perhaps a thousand or ten thousand files—and then we modify just a few of these and make a new commit: a new snapshot, containing the same 10,000 files or whatever, except for the two or three we changed. If we actually re-saved all the files every time, the repository would rapidly grow very fat. Git would be bloated and slow and unusable.1

To avoid this, git commit doesn't just store a simple archive of every file every time. Instead, Git stores each file as an internal blob object, using what computer scientists refer to as content-addressable storage. In this system, Git reduces each file's data to a hash ID (formally, an object ID or OID). The commit then stores—indirectly—a path name like path/to/file.ext and the hash ID for the content. This relies on the uniqueness of the hash ID, which is technically impossible mathematically,2 but works in practice.3

There's a bunch more theory behind all of this, but the end result is simple enough: Git de-duplicates content. The stored content in the repository is literally shared whenever two commits have two files that have the same content. This is true even if the two commits use different names for the file: it's the content, not the name, that matters. Renaming a file in Git doesn't change the stored file content, so its hash ID remains the same, and there's still just the one copy of the file.

This works well for reducing the overall size of the repository: when we extract a commit with 10,000 files, change just one file, and make a new commit, the new commit literally re-uses 9999 files and stores the one new (uniquely-modified) file. By de-duplicating identical content, most "normal" files get shared across commits. Even binary files benefit from this. But that's just the start.

This kind of sharing is pretty good: it reduces repository bloat by a huge factor. Many previous version control systems (prior to Git) attempted the same thing in a different way: each commit in those systems stored, not the new files, but the changes made. If a file had no changes to it, that took no space to store.

This idea, of storing just the changes, is more generally called delta compression or delta encoding. It not only stores duplicates as "no space used" (because there's nothing to change), it stores near-duplicates in very small amounts of space.

Git doesn't use delta-encoding at the commit level, but does use delta-encoding. Git just sneaks it in later, well after you've run git commit. Git does this with some commands that you, as a user, should never have to know about: git pack-objects and git repack. These take the underlying objects—not just the internal blob objects that hold file data, but also the tree and commit and tag objects that Git uses for the rest of its work—and does a lot of clever delta-encoding to shrink the objects down even further. Git hides most of this behind git gc, which you can run on your own—git gc is meant to be somewhat less user-hostile than other internal Git commands—but you still don't need to, as Git is supposed to automatically run git gc on its own whenever that will be profitable.

Delta-encoding, as performed by Git and other version control systems, relies in part on a general concept known as Shannon entropy or sometimes file entropy (see How to calculate the entropy of a file? and entropy (information theory)). Text files, of the sort used by computer programmers to write computer programs, generally have extremely low entropy and can therefore be compressed, often extremely effectively, this way.

Git also uses zlib compression for objects; this uses the same entropic principles, but in a totally different way. This method is less computationally intensive as it works on a single file (looking for low-entropy data within the file, as opposed to low-entropy information across multiple files). It's not as effective on text files as the delta compression, but because it's cheaper computationally, Git does this up front, at the time each object is stored into Git's object database.

In any case the key observation for these compression tricks is this: It works great for text files, but is terrible on already-compressed binaries like JPG images. While the inputs are complicated, the end result is simple to understand: Git's storage works great for computer program source code and badly for JPG images and stuff like that.

To this, we add one more observation: not only are these binary files hard to compress (so that Git does a lousy job of it), they also tend to be large. Your entire app, whatever it is, might be N megabytes, or N gigabytes, or however big it is for some number N ... and half or more of that is these big binary non-compressible files!


1Many might complain today that Git is still pretty unusable, but it's definitely not bloated or slow, especially compared with most of the alternatives. 😀 Its usability is, um, impacted, to use management-speak, by its relative user-unfriendliness. It might be nice sometimes if Git didn't force users to climb a steep learning curve.

2See the pigeonhole principle.

3"In theory, there's no difference between theory and practice, but in practice there is." It is possible to generate a hash collision: see How does the newly found SHA-1 collision affect Git? In practice, though, this does not seem to have occurred yet—not even for the file that the Google researchers made, due to the header Git shoves in front of the data.


How Git-LFS stores "large" files

Using the above, we observe that if we took all the "big, non-compressible" files out of Git entirely, we'd get a nice small repository, that would fit within storage limits on common hosting services like GitHub. So, let's come up with a way to do that.

Of course, we'd like to store those files somewhere. That "somewhere" just has to be "not Git", whatever it is. Let's write a new separate set of server programs, independent of the ones that Git uses on GitHub, and install those on a separate set of servers that we'll call "Git-LFS servers". These servers will be optimized to store large files without attempting to compress them and be fancy and so on. They'll just store the file content, and we don't really care how, as long as we can get the content back by some sort of unique name we present to the server later.

Now we'd like to hook this extra, side server up to Git somehow, so that:

git switch somebranch

extracts all the files including the large files that are not stored in Git at all. And now we come to the special trick involving the .gitattributes file. Now it's time to go back to how Git works internally.

How Git makes and extracts commits

As you already know, you use git switch or git checkout to check out a commit. This gets you all the saved files from the archive that Git made at the time you, or whoever, ran git commit to make the saved archive.

To save a lot of time and effort, Git records two things about each file it extracts from this commit archive, at the time you extract that archive using git checkout or git switch. As we saw above, the internal format for a file's data is its "blob hash". So Git stores—in something that Git calls its index or staging area—the blob hash and file name for each extracted file.

You can see these in git ls-files --stage output:

$ git ls-files --stage
100644 4860bebd32f8d3f34c2382f097ac50c0b972d3a0 0       .cirrus.yml
100644 c592dda681fecfaa6bf64fb3f539eafaf4123ed8 0       .clang-format
100644 f9d819623d832113014dd5d5366e8ee44ac9666a 0       .editorconfig
100644 b0044cf272fec9b987e99c600d6a95bc357261c3 0       .gitattributes
100644 c8755e38de81caf60768c0309b5348f03a120fc1 0       .github/CONTRIBUTING.md
[snip]
100644 d182756827fe5128292798b707a52aed25e7aa48 0       branch.c
100644 ef56103c050fa09d6087e2bade7f24240d79ae04 0       branch.h
100644 8901a34d6bf424680b9d13a1bdf332bedb4d8e20 0       builtin.h
100644 76277df326b4f47f594e4580f6f645ffa76455f3 0       builtin/add.c
100644 30c9b3a9cd72588fc2fb4495faedcc7cf3eda258 0       builtin/am.c
100644 58ff977a2314e2878ee0c7d3bcd9874b71bfdeef 0       builtin/annotate.c
100644 555219de40fa7e3097612a60eb953f81580a8de9 0       builtin/apply.c
[snip]
100644 9e36f24875d20711b61d243994f324d00a1b211e 0       xdiff/xutils.c
100644 fd0bba94e8b4d2442ba59d0a4327d2d53e10210a 0       xdiff/xutils.h
100644 d594cba3fc9d82d94b9277e886f2bee265e552f6 0       zlib.c

This is the result of running git ls-files --stage on a clone of the Git repository for Git itself. We can see file names and hash IDs (and more, such as the mode 100644 parts; the staging area has data I'm skipping over here as it's not really relevant to the .gitattributes stuff).

In order to make usable data show up in your working tree, Git had to:

  • turn the file's path name, such as xdiff/xutils.h, into a blob hash ID (here fd0bba94e8b4d2442ba59d0a4327d2d53e10210a);
  • retrieve the zlib and possibly delta compressed data for fd0bba94e8b4d2442ba59d0a4327d2d53e10210a and expand it into ordinary text; and
  • store that into a file named xutils.h in a folder (directory) named xdiff as required by your OS.

The latter might even be xdiff\xutils.h if you're on Windows: Git stores a forward slash regardless of how your OS requires you to spell folders-and-files. Git handles all the folder-izing and stuff automatically here, even though Git doesn't store folders, just files with long names with embedded (forward) slashes.

Later, if and when you run git commit, Git simply takes everything that's in this index or staging area—including the blob hash ID and the file's path name—and saves all that stuff into a new commit. If you've modified the file, you had to run:

git add xdiff/xutils.h

before you ran git commit. This git add step stored a new hash ID into the index, storing a new object if needed. Git did its zlib and/or any other compression it chose to do at the time you ran git add, so that everything was ready to go into the new commit.

Again, this is a lot of mechanism, and we have to distill it all down to what this means for us, in our goal of storing "big" files not in Git while storing the little files "in Git" and being able to get them all back. So let's think about this a bit:

  • The actual file contents are in blobs that are encoded at git add time and decoded at checkout.
  • The file name is in the commit (indirectly, but close enough).

So: What happens if we insert our own tricks while git add and git checkout / git switch are running? In our tricky little step, we'll trick Git: we'll have Git store a unique key, and we'll store the real file elsewhere: in Git-LFS.

When the user runs git add, we'll read the actual file, from the working tree, and save it away somewhere: perhaps on the LFS server right away, perhaps a bit later, but somewhere, we'll store the big file. Then we'll lie to Git and tell it that there is no big file, that there's a small file that consists of the key data we'll need to get the big file later.

When the user runs git switch or git checkout, we'll read the small file we had Git store. We'll use that to reach out to the LFS server and retrieve the big file. We'll then lie to Git and tell it that we wrote the small file to the user's working tree, while actually writing the big file to the user's working tree.

By cleverly lying to Git, we get Git to store the small file, which we call a pointer file. We use the pointer file's name to know where to put the real contents; we use the pointer file's data to retrieve the real contents.

The filter allows us to do the strategic lying

Git runs the smudge filter whenever it extracts a file from the repository. The original concept behind a smudge filter was that it might do something like, say, turn \n (LF-only line endings) into \r\n (CRLF line endings) for Windows, or expand keywords the way RCS/CVS did, or whatever. But we can use it to replace the pointer file data with the real data, from the large file that Git itself has never seen.

Git runs the clean filter whenever we use git add to copy a file back into Git. The original concept here was, again, that we could turn CRLF to LF-only, or replace an expanded CVS-style keyword with the un-expanded keyword, or whatever. But we can use it to replace the entire large file with pointer-file data, so that Git never sees the big file at all.

And that's how Git-LFS works and what the filter line is doing. Git never sees the big file, just the small pointer file.

What about the diff and merge lines?

The filter above covers the check-out step (the working tree gets the big file instead of the pointer file) and the git add step (the repository gets the pointer file instead of the big file). But these are not the only places we have to sneak in a substitute file.

In particular, git diff needs to compare files as they appear in commits. But the commits store the pointer files now, instead of the real files. We need a way to sneak into the diff operation and lie to Git here, too. Adding a diff driver to the .gitattributes file does exactly this. Writing the diff filter driver is a bit complicated, but the principle is simple.

Similarly, git merge needs to run multiple git diff operations. Defining a merge driver is even harder than defining a diff driver, but once again, the principle is simple and obvious: if we're going to merge at all, we want to merge the real files, not the pointer files, and we need to lie to Git to make that happen.

So all of this stuff is about lying to Git, cleverly and carefully, so that Git never sees the "big" files, and yet the user—the person running git diff or git show, git switch, git add, git commit, and so on—can work with the repository as if the big files were stored in each commit the same way the little files are actually stored in each commit. They're not, and clever lies to Git only make it seem that they are. Git has no idea that we're lying to it, and Git-LFS must lie to it correctly and consistently to make this all work. If it breaks, you must understand where the lies are happening.

Upvotes: 12

Related Questions