How to filter a pull of a repo by file type

Question

Is it possible to pull a git repo but include only some file types? For example: Can I pull only the .py files of a repo?

I already tried --filter and --sparse but didn't have a lot of success...

My goal is to run a command that will copy/clone/pull all files that end with .py to my hard-drive. The source for the files is a git repo

torek · Accepted Answer

As noted in comments, Git doesn't deal in terms of files but rather in terms of commits. When you use git clone, you get a new Git repository. Into that repository, your Git copies every commit from some other, existing repository. So you get all of their commits.

Interestingly, git clone does not copy any of their branches. Git turns each of their branch names into a remote-tracking name in your own repository. Then Git creates one branch in your repository, using one of their branches to set up all the details. So it looks like you got one of their branches—but in fact, you just got your own branch with the same name as one of their branches.

There are several things you need to know about "filtering" (which is a tricky term with multiple different meanings to different people):

While every commit has a full copy of every file, these files are stored in a compressed, Git-only, read-only, de-duplicated format. So even if hugefile.py exists in a million commits, if there are only seven different versions of hugefile.py, there are only seven copies of hugefile.py (each slightly different). The million commits all share.
The files that are inside commits are completely unusable by you. They are in a Git-only form. Git won't show you these files, since they're in that useless-to-you compressed form. Instead, when you git checkout some commit, or use git switch (new in Git 2.23) for the same purpose,¹ Git copies the files out into a usable form.

The files you see and work with are these copied-out ("checked-out") files. They live in an area we call your working tree.

When you talk about filtering, these are the things that come to mind:

Git offers the ability to make a shallow clone. A shallow clone is one in which we deliberately omit some older commits. Most stuff here still just works and you can deepen the shallow clone as needed.
Git offers the ability to make a single-branch clone. This often has a lot less benefit than a shallow clone, but by default, if you use the shallow option, you also get the single-branch option.

These two together avoid getting the most commits, so if your goal is to save space by only getting the last two or three commits on one branch, this default shallow-and-single-branch mode is what you want. In many repositories, though, the savings are minimal, because that compression that Git does is really good. It only really tends to break down with large binary files. If the repository you're cloning does not have a lot of large binary files, the savings here will tend to be minimal.
There is a new feature, not very well supported yet, called a partial clone. The idea here is that if you have an always-on (or almost-always-on) network, the fact that Git copies everything to your local machine is perhaps less important to you. This kind of clone doesn't bother copying things to your machine until they're actually needed.

Because it's so new, the tooling here is difficult to use. Don't try to do this unless you really need to or want to get very deep into Git.
There is an old feature that is getting new life today, called sparse checkout. With sparse checkout, your working tree doesn't get filled up with every file from the commit. You still have every file, you just don't see them. Instead, your working tree only gets some subset of the files.

This saves almost no space at all except when the working tree is bigger than the Git repository. Its main use is to de-clutter your work area. However, it has the potential for huge space and time savings when combined with the partial clone trick. That's why it is getting new life today. Unfortunately the combination isn't ready for general use.

There's one other thing that comes to mind, but you should almost certainly avoid it: filter is reminiscent of git filter-branch and git filter-repo. These let you take some existing commit(s) and copy them to new and supposedly improved commits. When you do this, your disk usage gets larger, initially, because you have the old commits plus the new ones. Never do this to save space! Instead, do this if and only if the old commits are somehow terrible, or polluted with toxic stuff, or whatever. When you're done filtering, you have an all-new repository, with all-new commits: you get everyone to throw out their old repositories entirely and switch to your shiny new one. (Or, if they think those old "polluted" commits are still the bee's knees, 23-skiddo, 1920s here we come, they ignore you entirely. 😀)

TL;DR: this kind of "filtering" is probably a bad idea. Make sure you really want to do that before you start.

¹The git checkout command has too many modes of operations and too many sharp edges. It's like a power saw with a faulty guard. So the Git folks made a better, less-powerful tool, git switch, that won't slice your thumb off by mistake. Because Git users are grouchy old guys who hate change 😀 the existing git checkout still exists and still works the same way. Use whichever you prefer. (The old checkout has been improved so that it doesn't slice your thumb off quite as often any more, too. But if you like to use the safer tool set, remember that the more-dangerous mode of git checkout is now in the newfangled git restore; the safe part is copied into git switch.)

How to filter a pull of a repo by file type

Answers (1)

Related Questions