tjwrona
tjwrona

Reputation: 9035

Is it possible to use ".gitignore" from the remote during a pull?

I have a situation where some files that were previously checked into Git now need to be ignored. To ignore them I added the files to ".gitignore" and did the following:

git rm -r --cached .
git add --all
git commit -m "Removed files from git tracking that should be ignored"
git push

Now I have a situation where I need to pull these ".gitignore" changes to another server, but when I do a git pull the files that were just added to ".gitignore" are not ignored, and instead they get removed entirely!

I think what is happening is during the pull it is using the local ".gitignore" file that does not ignore these files... and it detects that these files are no longer in git so it just removes them. If I add the files back manually and do another git pull then it begins working properly (now that the correct ".gitignore" file is on the server.)

Is there any way to tell the git pull to use the ".gitignore" file from the remote server instead of the local file that way these files get properly ignored and do not get removed on a git pull?

Upvotes: 1

Views: 1020

Answers (1)

torek
torek

Reputation: 488519

Listing a file in .gitignore does not mean ignore the file, nor does it mean don't remove the file, or any of the things you'd like it to mean. Nothing you do here will change that. We'll come back to .gitignore near the end of this answer, but let's look first at the horrible, terrible, no-good situation you're in, which you literally cannot fix. You'll have to step around it somehow.

Here is what is wrong

The facts are that some existing commits have these files, and some other existing commits—those that occurred before those files existed, and those that you made after you took the files out of Git's index—do not have these files. Nothing can change these facts.

To understand why this is the case, let's note that Git isn't about files. Git is about commits. Commits do store files, but it's a package deal: all or nothing. You have a commit, and you have all of its files. Or, you don't have the commit, and you don't have its files (so you'll use git fetch to add the commit to your collection, and then you have it and all of its files).

Moreover, the files that are inside commits are in a useless format (we'll come back to this in the next section). They're compressed and de-duplicated, because most commits mostly have the same files that are in some other commit. So Git doesn't store them as files, but rather as internal objects, which automatically de-duplicates them.

These objects are all numbered, with what Git calls a hash ID. Commits in particular always get a unique number. (Files, which might be duplicates of other commits' files, may have non-unique numbers, which is what de-duplicates them.) This number is actually a cryptographic hash of the internal object contents. This constrains Git: Not even Git can change a commit.

If you take a commit out, make some change, and put back the different thing, it's different and therefore gets a new and different hash ID. The existing object remains in the Git repository, under its existing ID. The new and improved (we hope) object is now added to the repository, under its new ID. Anyone who uses the old ID gets the old object. Anyone who uses the new ID gets the new object. This part is really pretty straightforward.

Now, the data inside a commit isn't just a snapshot of every file. That's in there, yes, but there is also some metadata, or information about the commit itself. This includes the name and email address of the person who made the commit, for instance. It also includes a time-stamp—this helps make sure every commit is totally unique, so that if two different people make the same commit, but for some reason, both claim to be the same person, they'll still get different commits unless they both make them at the same time (in which case, were they really two different people? Git says no).

So, there's all this metadata in each commit: author, committer, some time-stamps, log messages, and the like. But in amongst this metadata, Git has added its own information. Git stores, with each commit, the hash ID(s) of some set of earlier commits. Most commits store exactly one hash ID, which Git calls the parent commit.

These parent commit hash IDs form commits into backward-looking chains. We start at the right with the most recent (last) commit. Rather than writing out its real hash ID, we'll just call this commit H (for Hash):

            <-H

Commit H has in it both a snapshot—files—and metadata, and in that metadata, commit H stores the hash ID of an earlier commit. Let's pretend its hash ID is G, and draw it in:

        <-G <-H

Of course, commit G points to a still-earlier commit, which keeps on pointing backwards:

... <-F <-G <-H

Because each commit points backwards to its parent, Git can and will find the entire chain of commits if Git can just find the last commit in the chain. This is where branch names come in: each name in Git—branch name, tag name, remote-tracking name, and so on—stores one hash ID. For branch names, that hash ID is the ID of the last commit in its chain. This is true even if there are still-later commits in the chain, which happens normally when developing new stuff but not putting it on the main branch yet:

...--G--H   <-- main
         \
          I--J   <-- feature

Here, the feature branch has two more commits than the main branch. Commit J points back to I, which points back to H, which points back to G, and so on. So commits H and earlier are on both branches. Commits I and J are only on feature for now, but if we like, we can "slide the name main forward":

...--G--H--I--J   <-- main, feature

and now all commits are on both branches.

Branch names move about, and each name by definition picks out the last commit that's to be considered "on that branch". The commits themselves determine what's earlier on those branches. So it's the commits that matter: the names just let us find particular ones. And, remember, all commits are frozen for all time. No part of any existing commit can ever change.

Checking out a commit

As we noted above, the files inside a commit are in a format in which only Git can use them. Even then, Git can only read them. We need other programs to be able to read and write to our files. The solution is simple—and is the same as that used by other version control systems: Git copies the files out of the commit at some point. The copies, once out of the version control system, are now useful. In fact, they are just plain ordinary files: anything on the computer can use them. Git no longer has any control over them.

The normal, everyday way to get Git to copy the files out of some commit is to use git checkout. For instance, if we have:

...--G--H   <-- main
         \
          I--J   <-- feature

and we run git checkout main, Git will copy all the committed files out of commit H. This also has the side effect of selecting the name main as our current branch. Since the name main points to commit H, this means that H is our current commit. We can draw this by attaching the special name HEAD to the name main:

...--G--H   <-- main (HEAD)
         \
          I--J   <-- feature

Note that we now have two copies of every file: there's the committed one in H, which we can't touch, and there's an ordinary-form everyday file in what Git calls our working tree or work-tree.

In other version control systems, these two copies of the file are the only copies you'd find.1 If you want to know what's going on, you compare the work-tree version of some file to the active committed version: whatever is different is what we've changed. But for whatever reason—whether or not you think this is a good idea2—Git stores a third copy of each file3 in something that Git calls, variously, the index, or the staging area, or—rarely these days—the cache.

This third copy of each file sits between the read-only committed copy and the work-tree copy. Unlike the committed copy, it can be overwritten. It's pre-compressed and pre-de-duplicated, so that it's ready to go into the next commit. In fact, this is probably the best general way to think about Git's index / staging-area: it holds your proposed next commit.4

So, when you git checkout some commit, like commit H, Git:

  • fills in its index from the commit, so that your proposed next commit matches;
  • uses these same files to populate your working tree, so that you can see and work with your files; and
  • attaches HEAD to the branch name, assuming you used a branch name like main to select the commit. (If not, you get into "detached HEAD" mode, which we won't address here.)

If you now make changes to your working tree copies of files, you generally must also run git add: this tells Git make the index copy match the working tree copy. For files that you updated in place, this overwrites the old index copy with a new one. For files that you removed, this removes the index copy. For new files, this creates a new file in Git's index.

Either way, adding files stages the changes, because any time you run git commit, Git will make its new snapshot from whatever is in the index right then. If you have not altered the index, the new snapshot would exactly match the current snapshot. In this case, Git generally requires that you use the --allow-empty flag: the new commit is not actually empty, it's just that it matches the old one snapshot-wise (so Git wonders: why bother? and makes you use the flag).

Whether or not you make any changes to your work-tree and/or run git add to update Git's index from your work-tree, the current commit remains unchanged. Once you do make a new commit, Git:

  • gathers the metadata;
  • writes out the snapshot and metadata, getting a hash ID as a result; and
  • writes the hash ID into the current branch name.

We end up with, e.g.:

          K   <-- main (HEAD)
         /
...--G--H
         \
          I--J   <-- feature

and now there is a commit on main that is not on feature.


1The other read-only copies in the non-current commits would also be findable, as they are in Git, but they're not active the way the current commit's are.

2Other systems don't have an index, proving that it's possible to work without one.

3This "copy" is pre-de-duplicated, so most of the time, it takes almost no space. Calling it a copy is thus slightly misleading. However, unlike many of the other bits of Git that show through to the user, the fact that this "copy" is automatically de-duplicated is really well-hidden. You can just think of this as a third copy of each file, and it all just works. Well, until you start fiddling with internal commands like git ls-files --stage and git update-index: then you need to learn about git hash-object.

4The index gets expanded during a conflicted merge, which means that this description is incomplete, but it's at least not wrong. :-) The index also has a role in making Git go fast, which is why it has the old name cache. You mostly see this name in option flags these days, like git rm --cached.


Switching between commits that have different files

Let's say that between commit H and commit I we remove a file. Let's say further that we put it on a new branch X:

git checkout main
git checkout -b X
git rm somefile
git commit -m 'remove a file'

Commit H has a file named somefile and commit I lacks a file named somefile.

When we git checkout main, file somefile has to come back. Git copies it from commit H to Git's index and our work-tree, and now we have the file.

When we git checkout X to move back to commit I, file somefile has to go away. Git removes it from Git's index and from our work-tree.

This property is determined by the set of files in the two commits. I would say entirely, but if you experiment a bit, you'll see that Git's removal of file somefile is conditional:

git checkout main          # file somefile comes back
git rm --cached somefile   # take somefile out of Git's index

Because we use git rm --cached here, Git removes somefile from its index, but does not touch our work-tree copy. If we now run:

git checkout X

—remember, commit I, selected by branch name X, lacks the file somefile—Git doesn't remove somefile from our working tree. The reason is that after git rm --cached, file somefile is untracked.

Untracked files

An untracked file, in Git, is simply a file that is in your working tree right now, but not in Git's index right now. That's it—that is the entire definition—but it has a lot of consequences, including the fact that git commit would not include the untracked file in the new commit, and including the lack of removal that we just saw.

Because your working tree is yours, you can create and destroy files in it whenever you like.

Because Git's index is Git's, Git can put files there—but we know when it will do which thing:

  • When you git add a file, Git adds or removes the file based on what that file looks like in your working tree.

  • When you git checkout a commit, Git adds or removes files to/from the index based on whether those files are in the other commit.

  • When you run git rm --cached, Git removes files from Git's index as instructed.

  • Other cases not covered here include how git merge manipulates Git's index, how git reset and git restore work, and so on.

So, to some extent, you control which files are in Git's index—but they tend to mirror commits.

The index and working tree are specific to each clone

Git is a little bit ambivalent as to whether the index and working tree are included in a repository. Specifically, git init --bare makes a repository that has no working tree, but such a repository still has an index. (It probably shouldn't, but it does.) There is also the git worktree add command, since Git 2.5, which adds a pair—a working tree and an index—to a repository. So there can be multiple index-and-work-tree sets in any given repository.

It's clear enough, though, that git clone does not copy the index and work-tree of any existing repository (regardless of how many exist in that repository). So the index, or all indices, and work-tree(s) are private to each clone. You can't control any other repository's index and work-tree directly: you have to leave that to whoever might be operating Git on the other machine (assuming the other clone is on another machine).

About .gitignore

The .gitignore file is misnamed. A better name would be .git-do-not-complain-about-these-files-if-they-are-untracked-and-if-they-are-untracked-and-I-use-an-en-masse-add-command-do-not-add-them-to-the-index-either.

When we run git status, Git complains about untracked files. It gets very whiny! This is quite annoying, because with a work-tree being an ordinary directory, and the kinds of software that we use, we run programs that create lots of build artifacts in our work-trees. This leaves tons of untracked files. The git status command becomes noisy, and our productivity plummets.

To get git status to shut the ____ up, we can list these expected build products in a .gitignore file. This has no effect on whether those files are in the index right now. But if they're not in the index—if they are untracked right now—then git status won't complain about them.

Of course, if git status doesn't complain, it would be really nice if git add . also "worked right", by not adding them. So that's the second main effect of listing a file in .gitignore: if the file isn't already tracked—if it's not in the index now—and we run git add ., we want Git to not add it.

If the file is already in the index (is tracked), listing it in .gitignore has no effect on git status and git add: the file's status will be checked, and the git add will add the file.5 So for already-tracked files, .gitignore is no help. That's why the file's name isn't really right. But a more correct name would be unusable, so .gitignore it is.

Listing a file in .gitignore has one more side effect: it gives Git permission to clobber the file. This mostly involves checking out an old commit that does contain the file, when the file is being untracked-and-ignored. The checkout proceeds, and now you have the tracked file out, with the untracked one's data having been literally destroyed. So the real full name might be .git-about-some-files-that-may-be-untracked-and-what-to-do-if-they-are:do-not-complain-and-do-not-auto-add-but-do-feel-free-to-destroy-these-files, or something. (But colon characters are disallowed on many Microsoft systems.)


5There are some index flags—part of the cache aspect of the index—that one can abuse to prevent git status from looking, and git add from adding. These are the assume-unchanged and skip-worktree flags. They are not designed for this purpose, hence the abuse notion above, and they don't help with the particular problem to which this is an answer, but they're worth mentioning.


What you must do

You have several options. The most drastic, but easiest to enforce and perhaps easiest to do, is: make a whole new Git repository. Be careful never to add these files, so that they never become tracked and therefore never become a problem. Move all your systems to the new Git repository, abandoning (and eventually destroying) the old Git repository.

Alternatively, you can do a minimal update: make new commits that, compared to the old commits, remove the files. Then, go around to every deployment and update those systems by hand, carefully preserving the files while making them untracked. You can use the git rm --cached trick, or save the files outside the working tree during the checkout, or whatever else you like. Any of these methods works. Then be very careful never to return to the poisonous commits that will make those files tracked.

In between these two options, you can use a history rewriting tool (filter-branch, filter-repo, The BFG, whatever you like) to take your existing commits and turn them into new-and-improved commits in which those files have never been committed. This is a lot like the first and/or third option: you still have to go, carefully, to each deployment and update it, because the rewritten repository is, in effect, a new repository. It has the downside that someone with the old (pre-rewrite) repository can easily accidentally re-introduce the bad commits, if the histories sync up. (Whether they do depends on what's in the first few commits ever made and/or how you do the history rewrite.)

If you have full control of the software, the best option is usually this:

  • Move important data that can/should be version-controlled, if any, to a new file name. This might, for instance, be config.defaults.
  • Move important data that must not be version controlled—because it differs on each site—to a new file name. This might, for instance, be config.site.
  • Make sure config.site does not appear in any past, present, or future commit. Do list it in .gitignore so that it won't accidentally get added and committed.
  • Update all installations to have a correct (and by definition, untracked) config.site file.
  • Distribute the new revision. All sites now use the defaults and per-site configuration. None of the old commits need to change.

You can't change the past, but there's no need to do that.

Upvotes: 6

Related Questions