jessehouwing
jessehouwing

Reputation: 115017

Gitignore subfolder, except files with a specific name

I've got a large repo in which I'm downloading a large number of vsix files (visual studio and azure devops extension), then extract these and run a number of tools over them.

I'd like to keep some of the files in the extracted folder in my git repo, but ignore most of the rest (it's 30GB).

My folder structure looks like this:

\vsixs
   \publisher
       \extension
           \1.2.3
               \results-code.json
               \extension.vsixmanifest
               \extension.vsomanifest
               \taskname
                   \task.json
                   \node_modules
                   \index.js
            \...
         \...
      \...
somescripts.ps1
.gitignore

I've tried a number of things in my .gitignore

vsixs/
!vsixs/**
!result-code.json
!extension.*manifest
!task.json

And a number of other permutations... and read a number of other answers, but am yet to stumble on anything that works.

I think, due to performance optimizations that the format I should specify should be oddly specific, but I can't figure out how specific...

Upvotes: 1

Views: 229

Answers (1)

torek
torek

Reputation: 489708

Summary

The magic you want is:

!*/

Remember that it's somewhat expensive; use it wisely. (Combine it with !/* to un-ignore everything in the root of your working tree.)

Long

As always, the issue here is a mismatch between Git's storage format (as seen in commits), and your OS's storage format. Git handles only files—never folders—but the file names include forward slashes, e.g., somescripts.ps1 and vsixs/publisher/extension/1.2.3/results-code.json. Your OS, meanwhile, insists that there's no such file name as vsixs/publisher/extension/1.2.3/results-code.json. Instead, there's a folder named vsixs; inside that folder is a sub-folder named publisher; and so on. Eventually we get to a file named results-code.json (or some other name).

Git must bridge this gap for you:

  • Git does so during git checkout or git switch quite easily, by creating folders on demand: if Git needs to check out a file named vsixs/publisher/extension/1.2.3/results-code.json, it goes ahead and creates vsixs, then vsixs\publisher, and so on, as needed, until your OS is willing to create file named results-code.json in the appropriate folder.

  • Git does so during git commit by having Git's index contain a file named vsixs/publisher/extension/1.2.3/results-code.json. This files goes into Git's index (aka "staging area") at the time you check out the commit, which is the same time Git created all the folders (if they did not already exist).

  • Git does so during git add with ... well, how? This is where the problem occurs. It's git add that updates the copies of files that exist in Git's index, which is fine for files that already exist there because they were copied out of some earlier commit. But for new files, it's not so fine.

So: If a file named vsixs/publisher/extension/1.2.3/results-code.json is already in Git's index, git add -u or some other en-masse git add will generally find and update it, since Git already knows to look for it. But if it's not in Git's index, Git has to search for it, and this searching is very slow on many systems. This kind of searching literally requires that Git open every folder everywhere and read the folder's content:

  • the folder named vsixs has, as its content, sub-folders and/or files;
  • the sub-folder publisher inside vsixs has, as its content, more sub-folders and/or files;
  • and so on.

For each folder, Git can laboriously open it (e.g., open vsixs), read all its entries, get both the short name (extension) and the constructed full name (vsixs/publisher—note the forward slash here), and if that's a file, git add if that's appropriate, and if that's a folder, open and read it, recursively, to find more files and folders, and so on.

To speed this process up, if you give Git permission to "ignore" a folder during this scanning process, Git does so. So if Git is allowed to "ignore" vsixs, it simply does not open and read it and therefore never discovers publisher, much less anything in publisher.

The solution is now clear(ish): tell Git not to ignore some or all folders

If Git must look inside some folder in your working tree, Git cannot ignore that folder. Git can ignore some or all of the files in that folder, but it must open and read the folder itself. So don't give Git permission to skip that folder.

If you know the precise name of any particular folder, you can list that as a "do not ignore" exception:

!/vsixs/

for instance, in the top level .gitignore in your working tree, says that Git must open and read vsixs. (Should there be a vsxis/vsixs/, that's a different name, because /vsixs requires that the name end there, so this does not force Git to un-ignore that sub-vsixs.) Or:

!vsixs/

works similarly, except that it "un-ignores" any folder named vsixs. There's something odd going on here with the two slashes, which we'll get to in a moment. For now, just remember that this only works on entries named vsixs. If we want all folders, we need *, or rather, */:

!*/

As always, the leading ! means that this is a specific do not* ignore* entry.

The slashes

What's the difference between a, /a, a/, a/b, a/b/, /a/b, and so on? There are two obvious differences and one less-obvious one:

  • some start with slash;
  • some end with slash; and
  • some contain a slash that's not at the front or the back.

For .gitignore rules, the starts with and contains in the middle both have the same effect: they both "anchor" this ignore rule. The ends with has a different, separate meaning: it means only match a folder.

An un-anchored ignore rule—one that doesn't start with slash, nor (after removing any trailing slash that means "folder") contain a slash—has Git look only at the short name component. That is, if Git is reading vsixs and comes across publisher, Git uses the short part, publisher, as the name for the un-anchored tests. But it uses the longer, constructed, vsixs/publisher for the anchored tests. (Again, Git always uses a forward slash at this point.)

We'll mostly gloss over the trick Git uses for nested .gitignore files, which is that these constructed names start at the same level as the .gitignore file that was the source of the rule. For a top-level .gitignore file, this doesn't skip anything, so it's just the full path.

As noted in the parenthetical remark above, the trailing slash means match if and only the actual thing found, by reading the OS's folder in the working tree, is itself a folder. So we'll make use of that to force Git to read folders without also forcing Git to un-ignore files.

The bottom line

If we know the name of a folder we want searched, we list it:

!vsixs/

for example. This allows Git to ignore other folders.

If we don't know the name of the folder, we have to use a pattern. If we know nothing at all about the name, we must use:

!*/

which tells Git: if it's a folder, open and read it.

Since these are un-anchored—they neither start with, nor contain in the middle, a slash—they apply to all folders here and at any deeper level. That makes this last one expensive. If you can be sure that they should only apply at this level, you can put the .gitignore file here and use:

!/*/

to reduce its cost. Since vsixs is a known name, if you know its level, you can write:

!path/to/vsixs/

or:

!/vsixs/

and it's now an anchored rule and applies only to the one specific vsixs, which in theory makes it even cheaper (though if there's hardly anything named vsixs, it's pretty cheap to start with, so this might be negligible anyway).

Hence, to ignore all files, then un-ignore specific files and all directories, the rule would read:

*
!*/
!.gitignore
!file-to-keep

These un-ignore rules could go in any order since Git applies all rules to each entry it finds as it reads through folders. The last matching rule wins. So as Git scans through the top level, it finds:

  • .gitignore. This does match *, doesn't match */, does match .gitignore, and doesn't match file-to-keep. The matched entry is negated (has a leading !) so .gitignore is not "ignored": it's considered for git add-ing if untracked, and not complained-about as an tracked file.

  • vsixs. This does match * and then */, but does not match either other rule. The last matching rule says do not ignore so Git opens and reads it recursively.

Note that you can put a .gitignore file down as far as needed, provided Git opens and reads that folder. The one folder Git is guaranteed to open-and-read is the top level of the working tree. After that point, the ignore rules kick in.

Upvotes: 2

Related Questions