Reputation: 87

How can I delete commits which not belong to any branch?

If I merge branch A into branch B and then delete A, which branch do commits from branch A (now deleted) belong to? when I get the link of these commits, I found "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository."

check the screenshot here

I tried all answers of this question and the didn't solve mu question Listing and deleting Git commits that are under no branch (dangling?)

What is the solution?

Upvotes: 4

Answers (2)

fastkiller

Reputation: 21

I just encountered the identical issue on GitHub. My issue was that I discovered some remnants in my search after using git filter-branch to delete sensitive data from GitHub. But after I contacted GitHub support, the issue was resolved in 5 minutes.

Upvotes: 1

torek

Reputation: 489748

TL;DR

You can't "delete" this commit. You don't have this commit in the first place, and even if you did, you still wouldn't really be able to delete it.

Long

If I merge branch A into branch B and then delete A, which branch do commits from branch A (now deleted) belong to?

The answer you might want here—and it's not wrong, but it's not right either—is "branch B". Unfortunately, there's a fundamental error in this question. I believe this error itself comes from GitHub's rather misleading claim about a commit not "belong[ing] to any branch on this repository":

⚠️ This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

The mistake in the question itself—and the reason the text above is misleading—is that commits do not owe their existence to the presence of branch names, in Git. In Git, you can have as many commits as you like, and no branch names at all. Commits never "belong to" any branch in the first place.

Instead, a key notion we use with Git is that of reachability. If some commit C_i is reachable from some other commit C_j , in Git repository R, this means that C_i is an ancestor of C_j (or equivalently, C_i ≺ C_i, where "≺"—a sort of bendy less-than sign—is read as "precedes"): this defines a partial order on the commit graph, which is a Directed Acyclic Graph, or DAG.¹

We then define branch—or at least branch name—in Git as a reference (or ref) whose name begins with refs/heads/ and whose hash ID is constrained to be that of a commit, with ref itself defined as a name containing a hash ID.² Hence a name like refs/heads/branch is a branch name, and the hash ID stored in this branch name must be that of some commit.

A commit reaches all its ancestors. Each commit stores a list—usually just one entry long—of previous commit hash IDs. These form commits into chains, with backwards-pointing arrows. Simple cases have just one backwards arrow coming out of each commit, pointing to its predecessor:

A <-B <-C ... <-F <-G <-H

Here, in our simple repository R, we have exactly eight commits. Instead of using Git's actual commit hash IDs, we've given them single uppercase letters. (This scheme is impractical in a real repository: what would we do if there were more than 26 commits? But it's useful for thinking about the issues here.) The last commit we made, H, stores inside itself the hash ID of the second-to-last commit G. We say that H points to G. G stores F's hash ID, so we say that G points to F. This continues, backwards, down the entire chain of commits until we hit commit A. Because it's the very first commit, it can't point backwards, and it doesn't: its list of parent hash IDs is empty.

¹This particular definition is slightly backwards, because Git itself works backwards. In a normal DAG, reachability would imply successorship, rather than predecessorship. But in Git, all the arrows point backwards, instead of forwards.

²Most refs are spelled refs/*, but there exist pseudo-refs, such as HEAD and CHERRY_PICK_HEAD that do not. Pseudo-refs are special cases that make things troublesome for the folks working on putting in a proper ref database for Git. Note that pseudo-refs are per-work-tree, but some other refs, such as the bisection refs, are also per-work-tree.

Reachability from a branch name implies that a commit is on the branch

We start with our simple eight-commit repository ending with:

...--G--H   <-- main (HEAD)

We've added the branch name main and stored in main the real hash ID of commit H. So we say that main points to H, the same way that H points to G. (For text / ASCII-art purposes on Stack Overflow I've failed to draw the arrow from H to G as an arrow: we just have to remember that commits only link backwards. There's no link from G to H, only vice versa.)

This setup means that the name main allows us to reach any of the eight commits in the repository. Let's now add two more branch names, br1 and br2, both of which point to commit H:

...--G--H   <-- br1, br2, main (HEAD)

All three names point to commit H. So all eight commits are reachable from all names. This means that all commits are on all branches.

The HEAD attached to main here means that the branch name we're using is HEAD, and the current commit is therefore commit H. Let's run git checkout or git switch now, to change which name HEAD is attached to:

git switch br1

This results in:

...--G--H   <-- br1 (HEAD), br2, main

The only thing that changed at this point is that HEAD is now attached to br1. All eight commits are still there; we're still using commit H; but now we're using H via the name br1.

Now we make a new commit, in the usual way. This new commit gets a new, unique hash ID, but we just call it "commit I" to keep our sanity. To draw it in, we need to draw an arrow from I pointing back to H, and make the name br1 point to I, because that's how Git actually handles this internally:

          I   <-- br1 (HEAD)
         /
...--G--H   <-- br2, main

We're now using commit I, through name br1.

If we add another new commit J, we get:

          I--J   <-- br1 (HEAD)
         /
...--G--H   <-- br2, main

We're now using commit J through name br1. Now we switch to br2:

git switch br2

          I--J   <-- br1
         /
...--G--H   <-- br2 (HEAD), main

We're now using commit H again, through name br2. If we make two more commits, we get:

          I--J   <-- br1
         /
...--G--H   <-- main
         \
          K--L   <-- br2 (HEAD)

Et voila, we have "branches"! Commits I-J can be said to "belong to" branch br1 and commit K-L can be said to "belong to" branch br2, but what about the commits up through H? Some would say these "belong to" main, but Git makes no such distinction: they're "on" all three branches. When we first made the two br* branch names, all the commits were "on" all three branches, and those commits still are on all three branches. It's just that new commits I-J are only on br1, and new commits K-L are only on br2, at the moment.

When we use git merge, we're not really merging branches. We're really merging commits. Let's do that now:

git switch br1
git merge br2

The git switch makes commit J the current commit, by attaching HEAD to br1. The git merge has Git locate not one, not two, but three commits:

the current / HEAD commit, J;
the commit we name on the command line: br2 points to L; and
the merge base commit.

The merge base is defined through the Lowest Common Ancestor algorithm, whose inputs are commits J and L, and this algorithm coughs up the hash ID of commit H.

The merge itself works by comparing the stored snapshots in H, J, and L. This allows Git to figure out "what we did" on the H-I-J chain, and "what they did" on the H-K-L chain. (Note that commits I and K are used only for their linkage here, not for their snapshots: both link back to commit H, which caused commit H to be the merge base.)

If all goes well, Git makes the new merge commit on its own. This new merge commit M has not one but two parents—two backwards-pointing arrows—linking to *both commits J and L, like this:

          I--J
         /    \
...--G--H      M
         \    /
          K--L

I've temporarily taken all the branch names away from the drawing, because we don't need them: commits exist independently of any branch names. But making a new commit in Git always does the same thing:

when we made commit I, Git wrote the new commit's hash ID into the then-current branch name br1;
when we made commit J, Git wrote the new commit's hash ID into the then-current branch name br1; and
when we made commits K and L, Git wrote the new commit's hash ID into the then-current branch name br2;

so now that we made M, Git writes M's hash ID into the now-current branch name br1:

           I--J
          /    \
 ...--G--H      M   <-- br1 (HEAD)
          \    /
           K--L

Names main and br2 still exist, and still point to H and L. There's no room to draw in main, in this ASCII art, and there's no need to draw in br2 right now. We can instead ask: Which commits are reachable from the name br1? The answer is: All of them!

Commits K-L were only on br2 before, but now, because of merge commit M, commits K-L are on two branches. So that gets us an answer to your original question, as long as we rephrase it slightly: after a true merge, deleting a branch name is "safe" because the commits are still findable via the merge commit. They're now "on" both branches, and taking away one name—the name we're not using right now, br2 in this case—still leaves at least one other name that they're "on".

Caveat: not all merges are true merges

While the git merge command sometimes makes merge commits M:

          I--J   <-- br1
         /
...--G--H
         \
          K--L   <-- br2

we can come up with other situations where it doesn't:

...--P--Q   <-- br3 (HEAD)
         \
          R   <-- br4

Here, git merge br4 will do a fast forward operation instead of a merge, producing:

...--P--Q--R   <-- br3 (HEAD), br4

In the case of a fast-forward, deleting br4 is still safe: the commit that used to be only "on" br4, commit R, is now "on" br3 too.

But we can also run git merge --squash, and that particular option directs git merge to make a non-merge "squash" commit:

          I--J   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

[we now run:
    git merge --squash br2
and a second Git command that we're forced to run, to get:]

          I--J--S   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

New commit S here, after the git merge --squash, has the same snapshot we'd get if we had git merge make a true merge. That is, Git still went through all the normal "find the merge base, run two diffs, combine work" steps that it would do for a true merge. But then git merge stops and makes us run git commit,³ and when we do, git commit makes an ordinary non-merge commit, which I drew above as S.

³There's no good reason for this. If we want this action, we can run git merge --squash --no-commit. This combination is allowed! It does the exact same thing as git merge --squash today. But in the distant past, the --squash option was handled as a special case of --no-commit, so that it did both things, and that means that it now has to keep doing both things in the name of backwards compatibility.

Unfindable commits and garbage collection

In general, in Git, we—and even Git itself—find commits using names. They do not have to be branch names, but they very typically are branch names, or in clones, remote-tracking names (origin/* for instance). Regardless of the kind of name—branch name, tag name, remote-tracking name, internal bisection reference, or whatever it might be—the name holds one hash ID. If that's a commit hash ID, it suffices to find all predecessor commits, through the graph reachability algorithms.

But sometimes we might have commits that can only be found by one ref, such as the one branch name br2:

          I--J--S   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

If we delete this one ref br2, how do we find commits K-L?

One answer is: we don't. (Another—but only temporary—answer is to use Git's reflogs, which semi-secretly hold on to commit hash IDs for a while. Eventually the reflog entries expire, though, and then we're back to the "we don't" answer.)

If we, and Git, cannot find a commit, that commit becomes eligible for "garbage collection" under git gc.⁴ Git will run git gc for you, automatically, at irregular and Git-determined times. This git gc will—slowly and painfully, by crawling through the entire repository R—find any commits and other Git objects that are unreachable and, if several other conditions are met,⁵ actually remove the objects from the repository objects database.

This gc system is quite clever. It allows Git programs to generate internal objects freely whenever they're useful, then simply abandon them when they have no use any more. The garbage collector / janitorial service will come along later and clean up.

⁴git gc is part of general Git maintenance-and-housekeeping, and there is ongoing work now on a git maintenance command that will handle this in a more generalized, predictable, and usable fashion for server setups. It's possible that git maintenance may eventually be useful to ordinary users as well as Git administrators, but there is much more to be done here first.

⁵The most important one is that the object itself be sufficiently old. Since git gc can be running "in the background" at any time, it's important that it not delete an object that exists because some command—say, git commit—has just created the object, just now, but not yet gotten around to hooking it up to be visible. If git gc garbage-collected a fresh commit just before git commit could write its hash ID into a branch name, that would be bad. So everything gets, by default, a two-week window to finish up whatever it's doing. Two weeks is probably enough for git commit to finish writing out a new commit. 😀

(Kidding aside, Git's operation is so much faster than the version control systems we used in the old days. I'd best stop here, lest this turn into the Monty Python Four Yorkshiremen sketch.)

So what is this claim from GitHub about, then?

When GitHub say:

⚠️ This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

what do they mean?

The direct meaning is this commit is not reachable from a branch name in this repository. Both the phrase I just quoted and my more-precise replacement phrase have two this adjectives, both functioning as determiners: a specific commit—presumably one you have displayed in a browser—and a specific repository by which their Git found the commit.

We just said that we usually find a commit using a name. But in fact, we find the underlying commit object, in the repository's object database, using its hash ID. The hash ID is the "true name" of the commit. What we found using a name was the hash ID, not the commit object itself. If we have the hash ID in hand, that's all we need—and when we look at a GitHub repository commit using a browser, we supply the commit hash ID. For instance, the URL https://github.com/git/git/commit/5a73c6bdc717127c2da99f57bc630c4efd8aed02 ends with 5a73c6bdc7.... That's a commit hash ID. So GitHub can access the commit without using a branch name.

Now, this particular commit—5a73c6bdc7...—is the most recent master commit, at the time I write this, so if GitHub look at the branch names in this repository, they immediately see that 5a73c6bdc7... is the tip commit of master. If, by the time you read this, the GitHub refs/heads/master name locates some other commit, it's easy for the GitHub software to see if 5a73c6bdc7... is an ancestor of whatever the tip commit of master is then, and if so, 5a73c6bdc7... is still reachable from master, and hence still "on" branch master.

If we pick some other commit in some other repository, though, perhaps that commit isn't reachable from any branch name. If so, that satisfies the first part of the clause in the quote:

⚠️ This commit does not belong to any branch on this repository

and we could stop there, or speculate that perhaps git gc will eventually remove this commit. (A git gc won't remove the commit if it's findable by some other name, such as a tag name. You can have commits that can be found only via the tag name, not any branch name. Whether GitHub will produce a warning like this for such commits is up to GitHub.)

But they go on to add this:

and may belong to a fork outside of the repository.

This is GitHub-specific. Forks are not part of Git: they're a GitHub add-on. (This particular add-on is found on other hosting sites as well, but GitHub were there first, as far as I know. Bitbucket and GitLab appear to modeled their forks on GitHub's.)

A fork, on GitHub, is a server-side clone with added features. These added features include the ability to raise Pull Requests (which are another add-on feature from GitHub). To make these Pull Requests work, GitHub internally make use of some tricks that Git has implemented for decades (at least since Git v1.0.0 in 2005-ish). One of these tricks is that Git can look in other repositories' object databases to retrieve Git objects. This means that if you have some repository R_you on GitHub, someone else can have a different repository R_se (se stands for Someone Else) that they forked from your R_you. They can make commits of their own and send them to R_se ... and then, under whatever conditions might apply later, you can use a URL that embeds their commit hash ID under your repository's name and, due to this sort of alternates trick, see their commit, even though it's in their fork.⁷

The upshot of all of this is that you can view a commit that's in their repository, that they've raised as a Pull Request to you, as if it were in your repository. When you do this, you will definitely trigger the same "does not belong to any branch on this repository" condition. That will produce the warning you see here:

⚠️ This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

In this particular case, the commit truly is not in your repository R_you over on GitHub. So there's no way to delete it from R_you. It isn't in R_you, it's in R_se. You can just see it from R_you.

You can't tell, from the warning, which of underlying condition triggered the warning. All you know is that the commit you are viewing now is not reachable from any of the branch names in R_you. That could be because it is reachable, but not from a branch name; it could be because it isn't reachable, and is waiting to be GC-ed; or it could be because it's in someone else's repository.

In all three cases, you can't delete the commit itself directly. In one case, git gc might delete it on its own, but you can't make GitHub run git gc.⁸ In one case—if you have a tag for the commit, for instance—there may be something you can do that would then enable git gc to delete it on its own. And in the final case, it's not yours to delete, even if you could get git gc to do it.

⁷The same sort of rules might apply to your commits as well: if they know the hash ID, they may be able to see those commits in their fork. This has obvious security implications, and I don't know what GitHub may have done about these. GitHub have a lot of very competent programmers and they may have made this all quite secure, so that you can only see their commits if they have raised a PR to you, and they can only see your commits if they're public. I am merely pointing out that at the low level, careless use of "alternates" introduces various security issues, so be careful if you use this.

⁸GitHub support can run git gc for you, but you must contact them to get the process started. In that sense, you can make them run git gc, but it's kind of indirect.

Upvotes: 11

How can I delete commits which not belong to any branch?

Answers (2)

TL;DR

Long

Reachability from a branch name implies that a commit is on the branch

Caveat: not all merges are true merges

Unfindable commits and garbage collection

So what is this claim from GitHub about, then?

Related Questions