gitvisual-studio-codegit-commitsource-control-explorerunstage

Reputation: 14610

After a git reset --soft HEAD^ was performed, how do I remove files not needed by my repo?

I had to run

git reset --soft HEAD^

to undo a commit with large files (same issue). Now I can see my files again in VS Code Source Control Explorer(see below)

Problem - I want to remove these files from being added to my repo when committing and then pushing, so I added

/.angular/cache

to my .gitignore file, but that didn't remove the files from the Source Control window.

Question - Do I need to do something else to remove these files from Source Control? ex. unstage each file individually

Source Control in VS Code:

Upvotes: 1

Answers (2)

torek

Reputation: 490058

TL;DR Summary

As chepner suggested in a comment, you probably really wanted a --mixed reset, not a --soft reset. However, as j6t added, you can recover from this error by using git rm --cached -rf .angular/cache (be sure to use the --cached to avoid removing the working tree copies).

You will still want to create or update your .gitignore so that you can't accidentally add the .angular/cache contents again. You should git add the file after updating it (or creating it with appropriate initial contents).

Had you used --mixed (or the default which is --mixed), you might have had to add some other files besides the .gitignore, but you could update the .gitignore file first, then use a standard "add everything" (git add .) to add everything except the current untracked-and-ignored files. This tends to be easier to get right, which is why it's the recommended method. But adding everything, then un-adding (git rm --cached -rf) the unwanted files, also works. It's just klunky and easy to get wrong.

Necessary background (long)

Git is all about commits. Git is not about files, although Git commits hold files, and Git is not about branches, although Git branch names help you (and Git) find the commits. As such, you need to know what a Git commit is and does for you, and how git commit makes a new commit. So let's touch lightly on commits first.

A Git commit is ...

A Git repository is, at its heart, just two databases. One database hold commits and other supporting objects, and a separate database holds names—branch names, tag names, and other names that help Git find the commits. It's the commits-and-other-objects database that really matters: you can use a repository in which the names database is completely empty (it's just extremely awkward and unpleasant to do that), but you can't use a repository in which the objects database is empty.

Ignoring the supporting here, we'll just talk about the commit objects, since those are the ones you interact with. All the objects, including the commits, are numbered, but their numbers are big ugly random-looking things, such as ^{_{f01e51a7cfd75131b7266131b1f7540ce0a8e5c1}}. The commit hash IDs are always totally unique. So if you have this commit (the one that starts with f01e51a7cf, that I linked to the GitHub copy of it), it's literally this commit, and you must have a clone of the Git repository for Git. The need for these numbers to be totally unique to each commit is what makes them so big; it also makes them unusable by humans, but the computer is good at them, so the computer uses them. We use branch names instead, as we'll see in a moment.

The numbering system requires that nothing ever be changed once it gets stored. So the big all-objects-database you find in a Git repository is completely read-only; all you ever do is add to it.¹

Besides being numbered, each commit:

Holds a full snapshot of every file, in a compressed and (important for Git's internal operation and for your disk space needs) de-duplicated fashion. That is, when you make a new commit, if you have a million files but have only changed three of them, the new commit doesn't duplicate all 1 million files: it re-uses all but the new three.

These snapshots are like tar or WinZip archives, in a way, in that you can just have Git extract all the files from them any time you like. But they're not ordinary files: they're special Git-only compressed-and-de-duplicated things, that only Git can read, and nothing—not even Git itself—can overwrite. That's why they are safe to share across multiple commits.
Holds some metadata, such as the name and email address of the person who made the commit. Your log message goes in here too, when you make a commit. Crucially for Git's internal workings, Git adds, to this metadata, a list of previous commit hash IDs, which Git calls the parents of this commit.

Most commits just have a single parent. That is, most commits list one previous commit hash ID. This results in a simple backwards-looking chain. Suppose H stands for the hash ID of your newest commit. We'll draw this with an arrow coming out of it, indicating that H points to its parent commit (by storing the hash ID of the parent):

<-H

We'll call the parent commit hash ID G for short, and draw in commit G:

        <-G <-H

Of course, G is a commit too, so it has a list of parents; it's a typical commit, so it has just one parent, which we'll call F, and we'll draw in F:

... <-F <-G <-H

F is a commit, with a parent, so it points back still further, and so on. By following this chain backwards, one commit at a time, Git can find every commit in the chain, all the way back to the very first commit ever. That commit (presumably A in our simple example here) has an empty list of parent commit hash IDs, so that it doesn't point backwards at all, and that lets git log stop going backwards.

So that's how git log shows you the history. The history is nothing but the commits. The commits are the history; git log starts wherever you are now (usually at the latest) and works backwards, one commit at a time.

There's just one nasty little problem right now, and that is: to use this, we'd have to memorize at least one Git commit hash ID. How are we going to find commit H? How will we know it's the latest commit? This is where the branch names come in.

¹Git will occasionally stick "junk" in here, and will clean it out on its own later. Calling it append-only is therefore a bit wrong technically. But unless you have truly enormous files (petabytes at a time) or are on ridiculously tiny storage quotas, you don't normally have to worry about this.

Branch names help you find commits

Let's draw our little diagram without bothering with the arrows this time (out of laziness):

...--G--H

We need a way to quickly find the random-looking hash ID H. We'll have Git store that hash ID in a branch name, like main or master:

...--G--H   <-- main

That is, the branch name main will hold the raw hash ID of commit H. We won't have to remember it ourselves: we'll have Git do that job, by having Git store the hash ID of H under the name main.

If we create another new branch name:

...--G--H   <-- develop, main

then, right now, both names point to the same commit. But this is about to change, so now we need to know which of these names we're actually using to find commit H. Let's say we use git switch develop or git checkout develop, so that we're using the name develop, not the name main, to find commit H; we'll draw that like this:

...--G--H   <-- develop (HEAD), main

New commits, part 1: What happens with branch names

Without (yet) explaining how Git goes about making the snapshot for a new commit, let's say we now make a new commit, which gets a new, totally-unique, big ugly random-looking hash ID, which we'll just call I so that we don't have to guess it.

Commit I will store a single snapshot of all files, plus some metadata. In the metadata for I, Git will add our name and email address, our log message, the current date-and-time, and—so that history works—Git will set the (single) parent of new commit I to be existing commit H.

Git knows to use that hash ID because the name develop, to which HEAD is attached, currently points to commit H. That is, we're on develop, and develop means "commit H", so new commit I should point back to existing commit I. Git writes out the new commit metadata-and-snapshot and now we have:

          I
         /
...--G--H

Now Git does its sneaky trick. The name main pointed to H before, and still does. But Git, having allocated a new hash ID to new commit I, makes the current name point to I now:

          I   <-- develop (HEAD)
         /
...--G--H   <-- main

So now if we use the name develop, we get commit I, and if we use the name main, we get back to commit H. If we make another new commit, we'll have develop pointing to the newest such commit, which will point backwards to now-existing commit I, which will continue (forever) to point backwards to existing commit H, and so on:

          I--J   <-- develop (HEAD)
         /
...--G--H   <-- main

Note that commits up through and including H are on both branches, in Git's reckoning.

New commits, part 2: Where does the snapshot come from?

We noted above, at the beginning of this, that all Git commits are permanent (well, mostly²) and read-only (completely). Moreover, nothing can write to the files in a commit, and only Git itself can even read those files. So how are we ever supposed to get any work done?

The fact that the snapshot in a commit is like an archive gets us the first part of the answer. To check out a commit (with git checkout or git switch), Git will extract the archive. That is, Git will de-Git-ize and de-compress the data for each file and store it in an ordinary file: one the computer can read and write as usual. All the programs on your computer can deal with these files, as they're literally ordinary files.

These files go into what Git calls your working tree or work-tree. It's literally where you do your work. You don't work on/with the files that are in Git. You work on files that aren't in Git, that are instead extracted to your work area. Almost all version control systems (VCSes) work this way, for the simple reason that the VCS-ized saved files are in some internal format.

If Git were like most other version control systems, we'd stop here, with the two copies of each file from the current commit: one stored forever inside the commit, and one usable one. You'd work on the usable files and then use the "make new commit" action and Git would make the new commit from the updated files.

Git isn't like this. Instead, Git has another trick up its sleeves.³ Instead of keeping two copies of each file—the committed one, and the working one—Git keeps three copies of each file. Or rather, three "copies": in between the committed copy and the working copy, Git keeps an extra "copy", stored in the compressed-and-de-duplicated format, but not read-only. Because it's de-duplicated, this copy is initially shared with the committed copy. The de-duplication is invisible though, so we don't have to worry about it: we can just think of it as a third copy.

In other words, instead of just:

HEAD commit                  working tree
-----------                  ------------
Makefile                     Makefile
README.txt                   README.txt
main.py                      main.py

we have a third copy of each file, in what Git calls—in Git-y fashion—by three different names: the index, or the staging area, or (rarely these days) the cache. All three names refer to the same thing, and I'll use the name index here, but staging area is closer to the way you mostly use it, so feel free to use that name in your head:

HEAD commit      index       working tree
-----------    ----------    ------------
Makefile       Makefile      Makefile
README.txt     README.txt    README.txt
main.py        main.py       main.py

When you change the working tree copy, nothing happens to the index copy. You must run git add regularly; the git add command means make the index copy match the working tree copy. Git will, at git add time, read the working tree copy, compress it into the Git format, check for duplicates, and then:

if it's a duplicate: toss out the index copy (if there is one: if it's a new file name, there isn't an index copy here) and put in the duplicate;
if it's not a duplicate: toss out the index copy (if there is one) and put in the compressed form

and now either way, the index copy of the named file matches the working tree copy (and is pre-de-duplicated).

This means that the index copy is, at all times, ready to go into the next commit. Thus, what's in the staging area is what will go into the next commit. It is, in effect, the proposed next commit. You edit files in your working tree just to edit them, and then you use git add to update your proposed next commit.

²A commit that you can't find will eventually go away for real. We'll see that in a bit when we talk more about git reset.

³What kind of shirt or whatever does Git wear anyway?

New commits: summary

Since the index holds, at all times, the proposed next commit, all git commit has to do is:

gather all the metadata needed, including the current commit hash ID;
write out whatever is in Git's index right now as the new snapshot: it's pre-de-duplicated and ready to go;
store the new commit, obtaining the new unique hash ID; and
update the current branch name so that the new commit is the last commit on that branch.

Let's use our sample repository, which at this point looks like this:

          I--J   <-- develop (HEAD)
         /
...--G--H   <-- main

and watch the action as we git switch main or git checkout main, make a new branch name, switch to that new name, and then make a new commit:

git checkout main or git switch main: this
- removes (from index and working tree) the current commit's files (the files from J);
- extract the main-commit's files (from H) into index and working tree; and
- leaves us with this:
```
          I--J   develop
         /
...--G--H   <-- main (HEAD)
```
git checkout -b feature or git switch -c feature: this
- creates a new branch name feature pointing to H;
- switches to it: this would involve removing files from H and installing files from H, but Git sees that that's pointless and skips it;
- leaves us with this:
```
          I--J   develop
         /
...--G--H   <-- feature (HEAD), main
```
We now modify some files in the working tree. Nothing happens to Git's index yet, but then we run git add on those files, and now the versions in the index of the add-ed files match.

If we like, we can create new files from scratch, and add those. Or we can completely remove a file entirely, with git rm: that removes it from both the working tree and the index.
Now we run git commit. Git packages up whatever is in the index right now and makes a new commit that updates the current branch name, i.e., feature, so we end up with:
```
          I--J   develop
         /
...--G--H   <-- main
         \
          K   <-- feature (HEAD)
```

Note how git commit simply adds on to the drawing. No existing commit changes. If commit H has some files in it that we removed when we made K, that just means that commit K lacks those files. They're still there in commit H. It's the commits that matter. The commits hold the files. Find the commit, check it out, and you'll get the files.

`git reset`

With all the above in mind, we can now understand what git reset does—or at least, what git reset --soft, git reset --mixed, and git reset --hard do. The git reset command is very big inside and can do a lot of other things too, if you want it to; we are only going to cover the basic three here.

Suppose we have made another commit:

          I--J   develop
         /
...--G--H   <-- main
         \
          K--L   <-- feature (HEAD)

and we suddenly realize that commit L was terrible for some reason: wrong snapshot, bad commit message, whatever. We have several options, but the easiest one is to use git commit --amend. This command is a lie: it doesn't change commit L, it just makes a new commit L' that has commit K as its parent:

          I--J   develop
         /
...--G--H   <-- main
         \
          K--L'  <-- feature (HEAD)
           \
            L

Commit L still exists. We just can't find it any more because we use the names, not the hash IDs. The name feature now finds the "amended" commit L', not the original L. But we won't talk here about using git commit --amend; instead, we'll talk about using git reset.

The git reset command works by letting us move the current branch name. We can pick any commit, and make the name feature point to that commit. For instance, we could pick commit G if we wanted to. But let's pick commit K, using HEAD^ or HEAD~ to find it.⁴ Any of our three git reset commands, given the hash ID of commit K or a name that finds the hash ID of commit K, will do this:

          I--J   develop
         /
...--G--H   <-- main
         \
          K  <-- feature (HEAD)
           \
            L

Commit L still exists, but the name feature now points to commit K:

If we use git reset --soft HEAD^, Git moves the branch name, and then stops: the index and working tree are still from commit L.
If we use git reset --mixed HEAD^ or git reset HEAD^ (the default --mixed), Git moves the branch name, yanks all the commit-L files out of the index, and inserts into the index all the commit-K files. Then reset stops here.
If we use git reset --hard HEAD^, Git moves the branch name and yanks all the commit-L files out of the index and our working tree, and installs into the index and our working tree the commit-K files.

So this kind of git reset can do up to three things:

move the branch;
reset the index; and
reset the working tree too.

The flags tell it when to stop: --soft says to do step 1 and stop. The default is to do steps 1 and 2 and stop. The --hard flag tells it to do all three steps.

If we like, we can git reset --hard HEAD. That tells Git:

move the branch: find the current commit, to which the branch points now, and move the branch to point to the current commit;
reset the index; and
reset the working tree.

Because the commit we picked in step 1 is the commit the name already points to, the "move the branch" part was a no-op. The name didn't actually move anywhere. We used this git reset for its steps 2 and 3. It still did step 1, it just didn't achieve anything by doing step 1.

We can use git reset HEAD to make Git do nothing during step 1 and then reset the index, without touching the working tree. Note that if we leave out the commit hash ID—if we run git reset or git reset --hard—we get a mixed or hard reset that, in step 1, doesn't move the branch. But we're always doing step 1, even if it's just a big nothing.

⁴This syntax—the suffix ^ or ~—is part of a whole series of ways Git has of specifying commits. Since the commit is the raison d'être of Git, there should be a lot of ways to name a commit, and there are. See the gitrevisions documentation for a complete list of ways to name Git internal objects (mainly commits, but you can name the others as well).

Why the default `git reset` (i.e., mixed) is what you wanted

Using:

git reset HEAD^

you would have:

moved the branch name back one step, and
reset Git's index so that it holds the files from the selected commit; but
not reset the working tree (because --mixed suppresses step 3).

You could now git add each of the files you want to new or updated in the index, and not git add any files you didn't want updated in the index. In other words, you could now do the same thing you normally do, all the time, with Git.

By using git reset --soft HEAD^, you did step 1, but not step 2. So that meant you then had to adjust Git's index to not contain the files you didn't want to commit. That's also something you will do now and then in Git, but it's less common than git add-ing files that you do want to commit. It's not harmful to do it "backwards", it's just easier to get wrong.

Addendum 1: why `git update-index --assume-unchanged` is wrong

Git always makes new commits from the files that are in Git's index. As such, the git status command has, as one of its jobs, the job of telling you about files in your working tree that don't match the copies in Git's index.

That is, suppose you've modified three files from the contents they had earlier. Then you ran git add on one of them. Let's list what's in each of the three "active" copies of each file, with a version number added. You started with:

HEAD commit      index       working tree
-----------    ----------    ------------
Makefile(1)    Makefile(1)   Makefile(1)
README.txt(1)  README.txt(1) README.txt(1)
main.py(1)     main.py(1)    main.py(1)

After modifying all three files in the working tree, you have:

HEAD commit      index       working tree
-----------    ----------    ------------
Makefile(1)    Makefile(1)   Makefile(2)
README.txt(1)  README.txt(1) README.txt(2)
main.py(1)     main.py(1)    main.py(2)

Now you run git add main.py, forgetting to add Makefile and README.txt. You get:

HEAD commit      index       working tree
-----------    ----------    ------------
Makefile(1)    Makefile(1)   Makefile(2)
README.txt(1)  README.txt(1) README.txt(2)
main.py(1)     main.py(2)    main.py(2)

One of the jobs of git status is to compare the index and working tree copies and complain if they don't match. The result is a complaint that, hey, you forgot to git addthose two files.

The index copy of each file has two special flags you can set:

"assume unchanged"
"skip worktree"

These two flags have different purposes, but both of them are currently implemented the same in terms of git status: they both make git status not bother complaining about the files when they don't match.

Running git commit at this point would make a new commit in which main.py is updated, but README.txt and Makefile are not updated. In your case, the problem was that you added new .angular/cache/* files to Git's index. Setting "assume unchanged" on those requires that there be some copy of each of those files in Git's index (you cannot set these flags on files that aren't in Git's index). But you want each commit you make to lack these files entirely. You want the files to not be in Git's index.

Addendum 2: about `.gitignore` files

Listing files in a .gitignore does not affect whether the files are in Git's index. A file that is in Git's index right now, regardless of why it's there, makes that file a tracked file. Git's git status command will (in the absence of assume-unchanged or skip-worktree) complain about files that are in Git's index and your working tree and don't match. It doesn't matter whether these files are listed in a .gitignore or not: the files are tracked, so they'll be in the next commit, so Git will complain if they don't match.

What listing files in .gitignore does do is suppress a different complaint. Suppose you have some file xyzzy in your working tree right now. (Maybe you made it by pasting something you wanted to remember into a file, with the intent of removing it as soon as you've taken care of whatever it is.) Suppose further that this file isn't in Git's index—it won't be in your next commit—and it shouldn't be in your index commit, or any commit. Its presence in your working tree, though, will make git status complain that xyzzy is an untracked file.

An untracked file is, by definition, any file that exists in your working tree, but not in Git's index. (Any file that is in Git's index is a tracked file.) And git status complains about these, and you can't set any index-entry flags for these because they're not in Git's index on purpose. So there needs to be a way to stop git status from complaining—and that's the first part of what a .gitignore entry does.

Listing a file name or pattern in a .gitignore tells git status, hey, shut up about these files when they're untracked, I don't want to hear it, it's on purpose. To help out with git add, listing those files also means and when they're untracked, if I use an en-masse "add all files" command like git add ., don't add them either, and that's the second part of what a .gitignore entry does.

What all this means is that .gitignore is the wrong name for the file. It should be .git-do-not-complain-about-these-files-when-they-are-untracked-and-do-not-auto-add-them-with-en-masse-git-add-commands-either. But that's a ridiculous name to type, so .gitignore it is.

Summary: what you've learned

Git stores commits. These store files and metadata.
Git finds commits by hash ID, but humans don't do that, so Git saves hash IDs in names: a branch name holds one hash ID.
Git finds commits by looking at commits, which point to earlier commits. That's history, and is also how branches work: the branch name points to whichever commit we want to say is the latest commit.
Git builds new commits from whatever is in Git's index aka staging area.
The files you work on, in your working tree, are not in Git.
You may want some files that are in your working tree never to go into a commit. To make that happen, don't add them. To make that easier, use .gitignore files, but be aware that that's not sufficient on its own: you have to make sure you haven't already added them.
The git reset command moves the current branch name and then optionally resets the index and the working tree.
Moving a branch name can cause a commit to become un-find-able (unless you've memorized its hash ID).

The last part is what makes git reset a dangerous command: if you can't find a commit, what good is it? Also, git reset --hard erases stuff from your working tree, and working tree files are not in Git. They may have come out of a commit (in which case you can get them again, from what same commit), but they may not have (e.g., if you spent all day updating them and haven't committed yet).

Upvotes: 2

Alan Deep

Reputation: 2105

Adding your files to .gitignore alone is not enough.

You should do this:

git update-index --assume-unchanged <file_path>

and add your files to .gitignore

If you want to do this to a directory, open that directory in your shell (using cd):

and execute this:

git update-index --assume-unchanged $(git ls-files | tr '\n' ' ')

Upvotes: 0