How to ignore tweaked/untracked files I never want to share?

In my current project, there are a lot of files that need to be tweaked (but can't be committed) in order to have a viable dev environment, and in some cases there's a strange behavior, such as the server itself writing to the server.config file (apparently this is intended behavior). For example, replacing a configuration file's contents to disable async requests, or adding some flags to alter a certain module's behavior.

There are also some files created from templates. The templates are under version control, but the generated files are not, and it seems .gitignore won't care about rules if the file is untracked.

Eventually, I start to find the following issues:

  1. A lot of files I need to back up and paste on new branches (or if I'm only working on 1 feature branch at a time, I can checkout and carry my changes over, but it's not always the case).
  2. A lot of files I can't commit, but can't ignore either, cluttering my working copy's uncommitted changes (and I know I'll eventually make a mistake and commit something I shouldn't).
  3. When I pull changes from master I get a merge error, so I need to stash-pull-pop-merge, and sometimes that throws me an error and tells me that it's not possible to pop the stash due to pending changes.
  4. At times, Git will lock out and won't let me do anything. This might be a knowledge gap on my end, but I can't checkout another branch due to uncommitted changes, can't stash due to conflicting files, can't pull due to pending changes, etc. which ends up getting sorted out by deleting my working copy after backing up, cloning the branch again, and restoring the backup files, which leaves me at the point I was.

Given the project has been in production for over 2 decades, I hardly doubt we can 180-turn on the architecture design, so I hope there's something else I can do to avoid having a massive clutter that keeps giving me issues.

My intention here is to be able to handle these files so they are ignored for commit, but get updated when I pull changes, and I need to know when they are updated so I can add my tweaks if necessary.

I have tried this command, but I'm unsure of the implications: git update-index --skip-worktree <path_to_file>. This removes the file from the change list, but last time there was a branch refresh it started telling me some files had pending changes, and after reverting those files I would still get the same issue, and eventually had to delete and clone again.

Thank you in advance.

Upvotes: 0

Views: 537

Answers (1)

torek
torek

Reputation: 488183

If git update-index --skip-worktree path works at all, this means that the file is tracked. In Git, a tracked file is a file that is in Git's index. The git update-index command operates on Git's index, so the file has to be there, in Git's index, for git update-index to do this kind of thing to it.

Git has three names for this thing—this "index":

  • It's called the index, which has the advantage that this is a pretty meaningless name, so you won't bring a lot of preconceived but wrong notions with you when you talk or think about the index. But it has the disadvantage that it's a pretty meaningless name. You might not even remember it, except for the need to use git update-index to fiddle with it here.

  • It's called the staging area. This is a better term in a lot of ways, because it reflects how you normally use it. Unfortunately, this term doesn't capture every aspect of the index, which is why I usually call it "the index".

  • It's called the cache, which is probably the worst of the three names. These days about the only place you see this is in git rm --cached, which means remove the file from the index (using the old name "cache"). Most Git commands now have --staged as a synonym, or for git restore, as the only way to refer to the index / staging-area copy of a file; git rm is a key holdout that still requires --cached here.

The fact that it has three names is a clue to just how important it is that you understand how Git uses its index / staging-area. I contend that you cannot use Git correctly unless you know this, and your question itself is one of my arguments for making this claim. Whoever or whatever taught you the use of Git was remiss in not going into detail about the index here.

How to remember what the index / staging-area is about

The index actually has multiple roles, but there's one key one, and it's easy to remember, especially if you use the term staging area. When you run git commit, Git does not make the new commit from the copies of files you can see. Instead, Git makes the new commit from the copies of files that are in Git's index / staging-area.

That's it: that's the part you have to memorize. Git makes a new commit from the files that are in the staging area. You can't see this staging area directly though! Well, that's a bit of a lie: we can look directly, but it's kind of like staring at the sun. It's overwhelming.

Using git status to see what's in the index

We generally look to see what's in the staging area by using git status, which shows us two views of what's in the staging area. Both views conceal most of what's there, and the reason becomes clear enough if we use git ls-files --stage to look at the staging area directly:

$ git ls-files --stage | head -5
100644 4860bebd32f8d3f34c2382f097ac50c0b972d3a0 0       .cirrus.yml
100644 c592dda681fecfaa6bf64fb3f539eafaf4123ed8 0       .clang-format
100644 f9d819623d832113014dd5d5366e8ee44ac9666a 0       .editorconfig
100644 b0044cf272fec9b987e99c600d6a95bc357261c3 0       .gitattributes
100644 c8755e38de81caf60768c0309b5348f03a120fc1 0       .github/CONTRIBUTING.md
$ git ls-files --stage | tail -5
100644 947d9fc1bb8cf95719284de6563227485907988f 0       xdiff/xprepare.h
100644 8442bd436efeab81afc25db9d89da082638fcca4 0       xdiff/xtypes.h
100644 9e36f24875d20711b61d243994f324d00a1b211e 0       xdiff/xutils.c
100644 fd0bba94e8b4d2442ba59d0a4327d2d53e10210a 0       xdiff/xutils.h
100644 d594cba3fc9d82d94b9277e886f2bee265e552f6 0       zlib.c
$ git ls-files --stage | wc -l
    4182

This is in a Git repository that holds a clone of Git itself. Git isn't a small project, but it's hardly the biggest one ever either, and it has a bit over 4000 files in it at this point. If we tried to look here directly every time, we'd see all 4182 files every time. That's not very manageable. But suppose we run git status. I've deliberately made a few pointless changes and git add-ed one file here:

$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   Makefile

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   zlib.c

We see two files, out of the 4182, that are interesting. They're interesting because they don't match.

That is, git status reveals things in the index—and in the working tree—that "don't match". We see two files here: the remaining 4180 files all match up. But I said I made two changes and git add-ed one. The changes now seem to be split up. What's going on here?

Three "active copies" of every file

Whenever you are working in Git, there are three—or rather, up to three—of what I like to call "active copies" of every file. There are 4182 files in the index right now—4182 copies of files, that is—and there are 4182 files in the current commit as well, and if I run git commit, my new commit will contain 4182 files. There are actually many more than 4182 files in the working tree (I did a quick count with ls -R1 and got 5964) but many of those are "ignored" files, which we'll cover more precisely in a moment.

Something else you need to know about Git, that should have been covered in your Git introduction (but might not have been), is that the files in any given commit are kept in a special, read-only, Git-only, compressed and de-duplicated format. Only Git can read these files, and literally nothing—not even Git itself—is allowed to change any of these files. As such, the files inside a commit are only useful as an archive. To get any actual work done, the files have to be extracted.

The main useful copy of each file is the one in your working tree. These file are ordinary files: you can read them, you can write them, and in fact your OS can do anything that your OS can do with or to them, because they are ordinary files. These files are not in Git at all! They came out of Git, when Git extracted the commit. But now they're just ordinary files in an ordinary folder on your ordinary computer. You use them however you want.

When you go to make a new commit, however, Git doesn't use these files, the not-in-Git ones. Instead, Git keeps an in-between copy of every file. We might call it a "copy", in quotes, because this intermediate copy is stored in the de-duplicated format. What's special about the index copy is that Git can replace it, wholesale, with a new copy. This is what git add is all about. We run git add file and Git will read the working tree copy, compress it down to the internal format, do any de-duplicating needed, and prep the file to be committed.

This means that before git add, the index / staging-area holds every file, ready to be committed. After git add, the index / staging-area holds every file, ready to be committed. All git add does is swap in a different copy of some file—or, for a file name that isn't in the current commit and hence isn't in the index yet, add that as a new file.

Running git commit freezes these prepared index copies into a new commit. The new commit then becomes the current commit, and we're back to the situation where the current commit copies of files match the index copies (and the index copies are now by definition all duplicates and hence take no space).

Removing files: be careful!

Using git rm --cached removes the index copy of a file, without touching your working-tree copy. The next commit you make will now lack that file. This is all well and good, but remember: if you check out some old commit that has the file, Git now has to extract the committed copy. Git will put one copy (or "copy") into its index, and one copy into your working tree. This needs to overwrite the copy you carefully didn't remove, when you used git rm --cached. (To keep this answer shorter, I won't cover all the gory details about when Git thinks its safe to clobber such a file, and when it doesn't.)

Using plain git rm removes both the index and working-tree copies of the file. If you are able to use plain git rm, that's an indication that you don't need to be so careful about this. If you're using git rm --cached, it means watch out, other operations might clobber or remove the file later, so you'd better keep a spare copy somewhere safe.

Three copies = two diffs

Suppose you have a tiny little fence (perhaps to keep a tiny sheep or goat from eating your flowers, or to keep the unicorn from eating your burger):

           🦄
  ┃====🌈=┃======┃
  ┃  🍔   ┃      ┃

Note that there are three posts, but only two wooden cross-bar sections between them. In the same way that we need two sets of cross-bars to separate three fenceposts, we need two diffs to compare three snapshots.

If we run one git diff that compares the current commit's files to the files in Git's index, we get:

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   Makefile

That is, the Makefile copy in the current commit doesn't match the Makefile copy in Git's index. All the other files do match, and Git can tell this really fast because of the special de-duplicated format.

A second diff, comparing the files in Git's index to the files in your working tree, says:

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   zlib.c

That is, the Makefile in Git's index matches the Makefile in my working tree, so it didn't get mentioned. 4180 other files also all match and didn't get mentioned. But zlib.c, in my working tree, is different from zlib.c in Git's index. So I've "modified" this file, but not "staged" it. If I run git add, Git will replace the index copy of zlib.c—currently a duplicate of the committed copy—with a new de-duplicated zlib.c using whatever I did to the working tree file. Now the index and working tree copies will match, so this won't be mentioned in changes not staged for commit, but the current commit and index copies won't match, and the file will "move" to the changes staged for commit section.

It's possible, of course, to put something different in all three copies. Suppose I start with a fresh checkout. Then I add a blank line or a comment to the ordinary file named Makefile and run git add Makefile. Then I add another blank line or comment to the ordinary working tree file. Now Makefile will be both staged for commit—I changed the index copy—and not staged for commit—I further changed the working tree copy. Or, I could add the blank line or comment to Makefile, use git add, then change the working tree copy back. Now if I git add Makefile, the git add step will compress and de-duplicate the content—getting the original content back and thus re-using the committed copy—and now git add Makefile makes all apparent the changes vanish!

Try this out as an experiment. Run git diff, run git diff --staged, and run git diff HEAD, as you do this. Note how each git diff just compares two versions at a time, even though there are always three active versions.

Using .gitignore

If we add an entirely new file to the working tree, run git add file, and run git status, we see a new file in the git status output:

$ touch crazy
$ git add crazy
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   crazy

If we don't git add it, though, the status is not new file in the not staged for commit section. Instead, Git moves this out to its own special category:

On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    crazy

nothing added to commit but untracked files present (use "git add" to track)

In other words, Git complains about untracked files. Fortunately, as bk2204 commented, any untracked file can be listed in a .gitignore file, and doing so makes Git stop complaining about the file (though in my case I just want to remove the untracked file entirely).

Once a file is tracked, though, you can't have Git ignore it this way. The name .gitignore is misleading: Git won't ignore it unless it's also untracked.

Let's repeat this: A tracked file is a file that's in Git's index. How does a file get into Git's index? Well, one way is that we git add it. Now it's in Git's index. Another way is that we check out a commit that has the file. That, too, stuffs the file into Git's index. Those are the main two ways that files get into Git's index (the others are esoteric enough to ignore for this answer).

How do we get a file out of Git's index? Use git rm, or git rm --cached, or check out a commit that doesn't have the file. Peculiarly, we can also remove the file from the working tree and use git add. I never do this on purpose myself, because it feels so wrong to me, but git add means make the index copy match the working tree copy and Git takes that as far as remove the index copy if I've removed the working tree copy. (The git add command has --ignore-removal and --no-ignore-removal flags to control this. if you like; see the git add documentation.)

So we have a lot of control over whether a file is in Git's index, but any time we check out any existing commit, Git takes over control for a moment. So the set of "which files are tracked" and "which files are untracked" is partly under our control—we get to run git add and git rm—and partly under Git's, when we use git checkout or git switch to check out a commit, for instance. The git reset and git restore commands, for instance, can also affect Git's index, though these are not commands we use quite as often. If you're not sure about any given command, check its documentation.

The main thing to keep in mind here is that if any existing commit has the file in it, checking out that commit puts the file into Git's index, making it tracked. Since no existing commit can be modified in any way, you're stuck with it (short of "rewriting history" so that you have a different set of existing commits, with all the pain that this implies).

Using --skip-worktree

The git update-index command is a special-use command that specifically updates Git's index (hence the name). It can do a few tricks that other commands can't. In particular, for any given index copy of a file, there are two "magic" flags:

  • The "assume unchanged" flag tells Git I, the user, didn't modify the working tree copy of this file.

  • The "skip worktree" flag tells Git You, Git, should ignore the working tree copy of this file.

These both wind up having the same effect in almost all cases, but they're meant for two different specific uses:

  • The assume-unchanged flag is meant for use on file systems where lstat system calls are horribly slow or ineffective. Git is allowed to completely ignore this flag, so in theory you shouldn't use it for the kind of trick you want it for.

  • The skip-worktree flag is meant for Git's spare checkout operation. Git will set and clear this flag on its own when you're using sparse checkout, so in theory you shouldn't use it for the kind of trick you want it for.

That, of course, leaves you with no options: if you can't use either flag, you cannot tell Git: do not look at the working tree copy of this file, just keep using your index copy instead. In fact, both flags work—or mostly work—for the trick you want. Set either one to tell Git not to look at the working tree copy and to just keep using the index copy, and it mostly achieves what you want.

That mostly is a pretty big caveat, and the pain here is that when it doesn't work, you usually have to turn the flag off, deal with the file in various ways, and then turn the flag back on once you're done. It's far better to remove the file entirely and use a different file name that you keep as an untracked file and list in .gitignore so that you never accidentally commit it.

Conclusion

Because of the index and the three-copies-of-every-file thing, it's best never to have committed a file that should be untracked. That way it can be untracked now and forever, and you can list it in .gitignore to help make it stay that way. Being listed in .gitignore means Git won't complain about it, and en-masse "add all files" operations won't add it to Git's index, and it won't get committed.

Once some file has been committed some time in the past, using that commit will cause the file to be tracked. Checking out that commit will extract that file, regardless of any fiddling you try to do with index flags and git update-index. Moreover, git update-index can only work on a file that is in Git's index, and is therefore tracked, so if you find yourself using it this way, it means you're already in trouble.

(Consider making a change to your software so that all future commits use a new name for a new, untracked file that you will make sure never gets tracked.)

Upvotes: 1

Related Questions