user14067569
user14067569

Reputation:

Initialize "empty" directory

File/Directory Structure:

main/home/script.py
main/home/a/.init
main/home/b/.init

I want to setup my gitignore to exclude everything in the home directory but to include specific file types.

What i tried:

home/*      #exclude everything in the home directory and subdirectories
!home/*.py  #include python files immediately in the home directory
!**.init    #include .init files in all directories and subdirectories.

The problem, i can't seem to make sure .init files are included. The purpose of this file is to ensure that git will create all my directories, even if they do not have files yet. As such i want to place an empty 0 byte .init file inside of each directory to ensure the "empty" directory is committed by git.

Thanks.

Upvotes: 0

Views: 250

Answers (2)

torek
torek

Reputation: 489083

If you want to create, e.g., home/foo/.init and have this file be put into Git's index (for more about the index, see below), you will need to tell Git not to cut off searches into the home/*/ directories:

!home/*/

Then, as Fady Adal noted (but I've adjusted slightly), you probably also want:

!**/.init

so that when Git searches home/*/ it will find and de-ignore files named .init. Note that this de-ignores all .init files; perhaps you want:

!home/**/.init

here so that you can ignore a file named, e.g., nothome/foo/.init. (You might even ignore home/**/* while un-ignoring home/**/*/ and home/**/.init.)

Long: what's going on here?

I like to say that Git stores only files, not directories, and this is true—but the reason it's true has to do with the way Git builds new commits, which is, from Git's index.

Commits and the index

Each commit stores a full and complete copy of every file that Git knows about. This full-and-complete copy, however, is stored in a special, read-only, Git-only, frozen-for-all-time format in which duplicate files are automatically de-duplicated. That way, the fact that your first commit has (say) a README.md file that hardly ever changes, means that every commit just shares that README.md file. If it does change, the new commits begin sharing the new file. If it changes back, new commits after that go back to sharing the original file. So if there are only three versions of README.md, despite having 3 million commits, those 3 million commits all share the three versions of the file.

But note that these files are literally read-only. You can't change them. Not even Git can change them (for technical reasons having to do with hash IDs; this is true of all existing commits too). They're not in a format that most of your computer programs can use, either. That means that to work on the file, or even just to look at it, Git has to expand out the frozen-and-compressed, Git-only committed file, into an ordinary everyday form.

This means that when you pick some commit to work on it, Git has to extract all the files from that commit. So there are already two copies of each file: the frozen one, in the Git-only compressed-and-de-duplicated form, and the useful one, in your work-tree.

Most version control systems (VCS-es) have this same pattern: there's a committed copy of each file, in some VCS-specific form, saved inside the VCS, and there's a plain-text / ordinary-format version that you work on. Many VCSes stop here, with just the two active files (and one of them might be stored in some central repository, rather than on your computer; Git stores the VCS copy on your computer).

To make a new commit, then, the VCS obviously has to package up all of your work-tree (ordinary-format) files. Some version control systems literally do this. Most at least stick a cache in here to make this go faster, because doing things this way is painfully slow. Git, however, uses a sneaky trick.

In Git, there's a third copy of each active file. This third copy goes in what Git calls, variously, the index, the staging area, or—rarely these days—the cache. Technically this one isn't usually a copy as Git stores it in the internal, compressed-and-de-duplicated form, so it's really just a reference to a blob-hash-ID. This also means that it's ready to go into the next commit.

What this means is that the index—or staging area, if you prefer that term—can be described as holding the next commit you intend to make. The index takes on an expanded role during conflicted merges, so this is not a complete description, but it's good enough for thinking about it. When you use git commit to make a new commit, Git just packages up all the prepared, frozen-format, pre-de-duplicated files from the index. But the index holds only files—files with long names, like home/a/.init for instance, but files, not directories.

Checking out some commit, to work on it, means extracting the files from that commit. Git puts them—in their frozen format, but now change-able—into the index so that they're ready to make a new commit, and de-compresses them into ordinary format in your work-tree so that you can see and work on them. Then, when you use git add, you are telling Git: Make the index copy of some file match the work-tree copy of that file.

  • If there's already an index copy, the index copy gets booted out (it's probably safely in some commit though) and Git de-duplicates the work-tree copy into the appropriate compressed, frozen-format copy and puts that in the index instead.

  • If there wasn't an index copy, now there is. (It's also still de-duplicated: if you make a new file that has some old file's content, the old content from the old commit gets re-used.)

Either way, it's now ready to go into a new commit.

This is where .gitignore comes in

The .gitignore files are somewhat misnamed. They do not literally make Git ignore a file. A file's presence, or absence, in new commits you make is determined strictly by whether the file was in the index at the time you ran git commit.

What .gitignore does instead is two-fold. First, when you use git status, Git will complain about files that exist in your work-tree, but are not in Git's index. This complaint comes in the form of telling you that some file is untracked. This is literally what untracked means: that there is a file in your work-tree, where you can see it and edit it and so on, that isn't in Git's index right now. That's all it means, since you can put a file into Git's index (git add) or take one out (git rm or git rm --cached) at any time. But because the index is the source of each new commit, it's important to know whether some file is in the index or not—which is why Git complains if it's not.

Sometimes, though, this complaint is just annoying: Yes, I know this compiled object-code file is not in the index. Don't tell me! I know already and it's not important! So, to keep Git from complaining, you list the file in another file that probably should be called .git-do-not-complain-about-these-untracked-files.

But that's not the only thing that you get by listing the file in .gitgnore. It not only shuts up git status, it also makes git add not actually add the file. So git add * or git add . won't add the object-code file, or whatever. So to keep Git from adding, you list the file in a file that perhaps should be called .git-do-not-auto-add-these-files.

Hence .gitignore might be called .git-do-not-complain-about-these-untracked-files-and-do-not-automatically-add-them-either. But once those files are in the index, a .gitignore entry has no effect, so maybe it should be .git-do-not-complain-about-these-untracked-files-and-do-not-automatically-add-them-either-but-if-they-are-in-the-index-go-ahead-and-commit-them. But that's just ridiculous, so .gitignore it is.

Scanning directories is slow

When you have a massive Git repository, with millions1 of files in it, some of the things that Git normally does very quickly start really bogging down. Even at just a few hundred thousand files, some things can be slow. One of the slowest is scanning through a directory (or folder) to look for untracked files.2

By listing a directory, such as home/a/, in a .gitignore file, you give Git permission to take a shortcut. Normally, Git would say to itself: Ah, here is a directory home/a. I must open it and read out every file in it, and see if those files are in the index or not, in order to decide whether these files are untracked and/or need to be added. But if the entire directory is to be ignored, Git can stop a little short: Wait! I see home/a is to be ignored! I can skip it entirely! And so it goes on to home/b/ instead of looking inside home/a/.

To make sure that Git doesn't skip a directory, you must make sure that it's not ignored. This is where trailing slashes in the .gitignore entries come in.


1Most aren't even this big, but Microsoft are working on making Git perform with repositories of this size.

2The usual trick for these kinds of speed issues is to insert a cache. The problem here is that untracked files are, by definition, not in the index. Git's index does have an extension to do some untracked caching but this can never catch everything.


The .gitignore line format

The format of lines in .gitignore is:

  • blank and comment lines are ignored;
  • lines that start with ! are negations;
  • lines that end with / refer to directories; and
  • the rest of the line names the file, complete with leading and/or embedded slashes.

A negation only makes sense to undo the effect of an earlier line. In general, later lines override earlier lines, but there is that one big exception having to do with skipping entire directories.

A line that—after any ! marking a negation—starts with a slash provides a rooted or anchored path.3 So /home for instance means just that—/home—and not something like a/home. A line that contains an embedded slash is also rooted, so that home/a and /home/a both mean the same thing.

The final slash, if there, gets removed from the "is rooted/anchored" test. That is, home/ and /home/ are different, because home is non-rooted/non-anchored but /home is rooted/anchored.

As Git scans through directories (folders) and subdirectories (sub-folders), it will try matching each file or directory name it finds at each level to all the non-rooted / non-anchored names. Only those at the level of that particular .gitignore get matched against the rooted / anchored names, though.

A trailing slash in the pattern means match only if this is a directory. So if home/a is a directory, it matches both home/* and home/*/; if home/xyz is a file, it matches only home/*, not home/*/.

Hence, if we want to ignore all files underneath home, we use:

home/*

to ignore them. This has an embedded slash so it is rooted/anchored. Unfortunately it gives Git permission to skip all subdirectories, so we must counter that with:

!home/*/

which has a trailing slash so that it applies only to directories. It too is anchored.


3I'm borrowing the term anchored from regular expression descriptions here. Rooted refers to the top level of the Git repository work-tree. Both terms should convey the right idea; use whichever you like better.

Upvotes: 1

Fady Adal
Fady Adal

Reputation: 345

It should be

home/*      #exclude everything in the home directory and subdirectories
!home/*.py  #include python files immediately in the home directory
!**/*.init    #include .init files in all directories and subdirectories.

Upvotes: 0

Related Questions