Reputation: 9035
I have a situation where some files that were previously checked into Git now need to be ignored. To ignore them I added the files to ".gitignore" and did the following:
git rm -r --cached .
git add --all
git commit -m "Removed files from git tracking that should be ignored"
git push
Now I have a situation where I need to pull these ".gitignore" changes to another server, but when I do a git pull
the files that were just added to ".gitignore" are not ignored, and instead they get removed entirely!
I think what is happening is during the pull it is using the local ".gitignore" file that does not ignore these files... and it detects that these files are no longer in git so it just removes them. If I add the files back manually and do another git pull then it begins working properly (now that the correct ".gitignore" file is on the server.)
Is there any way to tell the git pull
to use the ".gitignore" file from the remote server instead of the local file that way these files get properly ignored and do not get removed on a git pull?
Upvotes: 1
Views: 1020
Reputation: 488519
Listing a file in .gitignore
does not mean ignore the file, nor does it mean don't remove the file, or any of the things you'd like it to mean. Nothing you do here will change that. We'll come back to .gitignore
near the end of this answer, but let's look first at the horrible, terrible, no-good situation you're in, which you literally cannot fix. You'll have to step around it somehow.
The facts are that some existing commits have these files, and some other existing commits—those that occurred before those files existed, and those that you made after you took the files out of Git's index—do not have these files. Nothing can change these facts.
To understand why this is the case, let's note that Git isn't about files. Git is about commits. Commits do store files, but it's a package deal: all or nothing. You have a commit, and you have all of its files. Or, you don't have the commit, and you don't have its files (so you'll use git fetch
to add the commit to your collection, and then you have it and all of its files).
Moreover, the files that are inside commits are in a useless format (we'll come back to this in the next section). They're compressed and de-duplicated, because most commits mostly have the same files that are in some other commit. So Git doesn't store them as files, but rather as internal objects, which automatically de-duplicates them.
These objects are all numbered, with what Git calls a hash ID. Commits in particular always get a unique number. (Files, which might be duplicates of other commits' files, may have non-unique numbers, which is what de-duplicates them.) This number is actually a cryptographic hash of the internal object contents. This constrains Git: Not even Git can change a commit.
If you take a commit out, make some change, and put back the different thing, it's different and therefore gets a new and different hash ID. The existing object remains in the Git repository, under its existing ID. The new and improved (we hope) object is now added to the repository, under its new ID. Anyone who uses the old ID gets the old object. Anyone who uses the new ID gets the new object. This part is really pretty straightforward.
Now, the data inside a commit isn't just a snapshot of every file. That's in there, yes, but there is also some metadata, or information about the commit itself. This includes the name and email address of the person who made the commit, for instance. It also includes a time-stamp—this helps make sure every commit is totally unique, so that if two different people make the same commit, but for some reason, both claim to be the same person, they'll still get different commits unless they both make them at the same time (in which case, were they really two different people? Git says no).
So, there's all this metadata in each commit: author, committer, some time-stamps, log messages, and the like. But in amongst this metadata, Git has added its own information. Git stores, with each commit, the hash ID(s) of some set of earlier commits. Most commits store exactly one hash ID, which Git calls the parent commit.
These parent commit hash IDs form commits into backward-looking chains. We start at the right with the most recent (last) commit. Rather than writing out its real hash ID, we'll just call this commit H
(for H
ash):
<-H
Commit H
has in it both a snapshot—files—and metadata, and in that metadata, commit H
stores the hash ID of an earlier commit. Let's pretend its hash ID is G
, and draw it in:
<-G <-H
Of course, commit G
points to a still-earlier commit, which keeps on pointing backwards:
... <-F <-G <-H
Because each commit points backwards to its parent, Git can and will find the entire chain of commits if Git can just find the last commit in the chain. This is where branch names come in: each name in Git—branch name, tag name, remote-tracking name, and so on—stores one hash ID. For branch names, that hash ID is the ID of the last commit in its chain. This is true even if there are still-later commits in the chain, which happens normally when developing new stuff but not putting it on the main branch yet:
...--G--H <-- main
\
I--J <-- feature
Here, the feature
branch has two more commits than the main
branch. Commit J
points back to I
, which points back to H
, which points back to G
, and so on. So commits H
and earlier are on both branches. Commits I
and J
are only on feature
for now, but if we like, we can "slide the name main
forward":
...--G--H--I--J <-- main, feature
and now all commits are on both branches.
Branch names move about, and each name by definition picks out the last commit that's to be considered "on that branch". The commits themselves determine what's earlier on those branches. So it's the commits that matter: the names just let us find particular ones. And, remember, all commits are frozen for all time. No part of any existing commit can ever change.
As we noted above, the files inside a commit are in a format in which only Git can use them. Even then, Git can only read them. We need other programs to be able to read and write to our files. The solution is simple—and is the same as that used by other version control systems: Git copies the files out of the commit at some point. The copies, once out of the version control system, are now useful. In fact, they are just plain ordinary files: anything on the computer can use them. Git no longer has any control over them.
The normal, everyday way to get Git to copy the files out of some commit is to use git checkout
. For instance, if we have:
...--G--H <-- main
\
I--J <-- feature
and we run git checkout main
, Git will copy all the committed files out of commit H
. This also has the side effect of selecting the name main
as our current branch. Since the name main
points to commit H
, this means that H
is our current commit. We can draw this by attaching the special name HEAD
to the name main
:
...--G--H <-- main (HEAD)
\
I--J <-- feature
Note that we now have two copies of every file: there's the committed one in H
, which we can't touch, and there's an ordinary-form everyday file in what Git calls our working tree or work-tree.
In other version control systems, these two copies of the file are the only copies you'd find.1 If you want to know what's going on, you compare the work-tree version of some file to the active committed version: whatever is different is what we've changed. But for whatever reason—whether or not you think this is a good idea2—Git stores a third copy of each file3 in something that Git calls, variously, the index, or the staging area, or—rarely these days—the cache.
This third copy of each file sits between the read-only committed copy and the work-tree copy. Unlike the committed copy, it can be overwritten. It's pre-compressed and pre-de-duplicated, so that it's ready to go into the next commit. In fact, this is probably the best general way to think about Git's index / staging-area: it holds your proposed next commit.4
So, when you git checkout
some commit, like commit H
, Git:
HEAD
to the branch name, assuming you used a branch name like main
to select the commit. (If not, you get into "detached HEAD" mode, which we won't address here.)If you now make changes to your working tree copies of files, you generally must also run git add
: this tells Git make the index copy match the working tree copy. For files that you updated in place, this overwrites the old index copy with a new one. For files that you removed, this removes the index copy. For new files, this creates a new file in Git's index.
Either way, adding files stages the changes, because any time you run git commit
, Git will make its new snapshot from whatever is in the index right then. If you have not altered the index, the new snapshot would exactly match the current snapshot. In this case, Git generally requires that you use the --allow-empty
flag: the new commit is not actually empty, it's just that it matches the old one snapshot-wise (so Git wonders: why bother? and makes you use the flag).
Whether or not you make any changes to your work-tree and/or run git add
to update Git's index from your work-tree, the current commit remains unchanged. Once you do make a new commit, Git:
We end up with, e.g.:
K <-- main (HEAD)
/
...--G--H
\
I--J <-- feature
and now there is a commit on main
that is not on feature
.
1The other read-only copies in the non-current commits would also be findable, as they are in Git, but they're not active the way the current commit's are.
2Other systems don't have an index, proving that it's possible to work without one.
3This "copy" is pre-de-duplicated, so most of the time, it takes almost no space. Calling it a copy is thus slightly misleading. However, unlike many of the other bits of Git that show through to the user, the fact that this "copy" is automatically de-duplicated is really well-hidden. You can just think of this as a third copy of each file, and it all just works. Well, until you start fiddling with internal commands like git ls-files --stage
and git update-index
: then you need to learn about git hash-object
.
4The index gets expanded during a conflicted merge, which means that this description is incomplete, but it's at least not wrong. :-) The index also has a role in making Git go fast, which is why it has the old name cache. You mostly see this name in option flags these days, like git rm --cached
.
Let's say that between commit H
and commit I
we remove a file. Let's say further that we put it on a new branch X:
git checkout main
git checkout -b X
git rm somefile
git commit -m 'remove a file'
Commit H
has a file named somefile
and commit I
lacks a file named somefile
.
When we git checkout main
, file somefile
has to come back. Git copies it from commit H
to Git's index and our work-tree, and now we have the file.
When we git checkout X
to move back to commit I
, file somefile
has to go away. Git removes it from Git's index and from our work-tree.
This property is determined by the set of files in the two commits. I would say entirely, but if you experiment a bit, you'll see that Git's removal of file somefile
is conditional:
git checkout main # file somefile comes back
git rm --cached somefile # take somefile out of Git's index
Because we use git rm --cached
here, Git removes somefile
from its index, but does not touch our work-tree copy. If we now run:
git checkout X
—remember, commit I
, selected by branch name X
, lacks the file somefile
—Git doesn't remove somefile
from our working tree. The reason is that after git rm --cached
, file somefile
is untracked.
An untracked file, in Git, is simply a file that is in your working tree right now, but not in Git's index right now. That's it—that is the entire definition—but it has a lot of consequences, including the fact that git commit
would not include the untracked file in the new commit, and including the lack of removal that we just saw.
Because your working tree is yours, you can create and destroy files in it whenever you like.
Because Git's index is Git's, Git can put files there—but we know when it will do which thing:
When you git add
a file, Git adds or removes the file based on what that file looks like in your working tree.
When you git checkout
a commit, Git adds or removes files to/from the index based on whether those files are in the other commit.
When you run git rm --cached
, Git removes files from Git's index as instructed.
Other cases not covered here include how git merge
manipulates Git's index, how git reset
and git restore
work, and so on.
So, to some extent, you control which files are in Git's index—but they tend to mirror commits.
Git is a little bit ambivalent as to whether the index and working tree are included in a repository. Specifically, git init --bare
makes a repository that has no working tree, but such a repository still has an index. (It probably shouldn't, but it does.) There is also the git worktree add
command, since Git 2.5, which adds a pair—a working tree and an index—to a repository. So there can be multiple index-and-work-tree sets in any given repository.
It's clear enough, though, that git clone
does not copy the index and work-tree of any existing repository (regardless of how many exist in that repository). So the index, or all indices, and work-tree(s) are private to each clone. You can't control any other repository's index and work-tree directly: you have to leave that to whoever might be operating Git on the other machine (assuming the other clone is on another machine).
.gitignore
The .gitignore
file is misnamed. A better name would be .git-do-not-complain-about-these-files-if-they-are-untracked-and-if-they-are-untracked-and-I-use-an-en-masse-add-command-do-not-add-them-to-the-index-either
.
When we run git status
, Git complains about untracked files. It gets very whiny! This is quite annoying, because with a work-tree being an ordinary directory, and the kinds of software that we use, we run programs that create lots of build artifacts in our work-trees. This leaves tons of untracked files. The git status
command becomes noisy, and our productivity plummets.
To get git status
to shut the ____ up, we can list these expected build products in a .gitignore
file. This has no effect on whether those files are in the index right now. But if they're not in the index—if they are untracked right now—then git status
won't complain about them.
Of course, if git status
doesn't complain, it would be really nice if git add .
also "worked right", by not adding them. So that's the second main effect of listing a file in .gitignore
: if the file isn't already tracked—if it's not in the index now—and we run git add .
, we want Git to not add it.
If the file is already in the index (is tracked), listing it in .gitignore
has no effect on git status
and git add
: the file's status will be checked, and the git add
will add the file.5 So for already-tracked files, .gitignore
is no help. That's why the file's name isn't really right. But a more correct name would be unusable, so .gitignore
it is.
Listing a file in .gitignore
has one more side effect: it gives Git permission to clobber the file. This mostly involves checking out an old commit that does contain the file, when the file is being untracked-and-ignored. The checkout proceeds, and now you have the tracked file out, with the untracked one's data having been literally destroyed. So the real full name might be .git-about-some-files-that-may-be-untracked-and-what-to-do-if-they-are:do-not-complain-and-do-not-auto-add-but-do-feel-free-to-destroy-these-files
, or something. (But colon characters are disallowed on many Microsoft systems.)
5There are some index flags—part of the cache aspect of the index—that one can abuse to prevent git status
from looking, and git add
from adding. These are the assume-unchanged and skip-worktree flags. They are not designed for this purpose, hence the abuse notion above, and they don't help with the particular problem to which this is an answer, but they're worth mentioning.
You have several options. The most drastic, but easiest to enforce and perhaps easiest to do, is: make a whole new Git repository. Be careful never to add these files, so that they never become tracked and therefore never become a problem. Move all your systems to the new Git repository, abandoning (and eventually destroying) the old Git repository.
Alternatively, you can do a minimal update: make new commits that, compared to the old commits, remove the files. Then, go around to every deployment and update those systems by hand, carefully preserving the files while making them untracked. You can use the git rm --cached
trick, or save the files outside the working tree during the checkout, or whatever else you like. Any of these methods works. Then be very careful never to return to the poisonous commits that will make those files tracked.
In between these two options, you can use a history rewriting tool (filter-branch, filter-repo, The BFG, whatever you like) to take your existing commits and turn them into new-and-improved commits in which those files have never been committed. This is a lot like the first and/or third option: you still have to go, carefully, to each deployment and update it, because the rewritten repository is, in effect, a new repository. It has the downside that someone with the old (pre-rewrite) repository can easily accidentally re-introduce the bad commits, if the histories sync up. (Whether they do depends on what's in the first few commits ever made and/or how you do the history rewrite.)
If you have full control of the software, the best option is usually this:
config.defaults
.config.site
.config.site
does not appear in any past, present, or future commit. Do list it in .gitignore
so that it won't accidentally get added and committed.config.site
file.You can't change the past, but there's no need to do that.
Upvotes: 6