Removing unreferenced objects from remote

Question

I'm wondering if a remote git repo does (or should do) automatic delete of unreferenced file objects (and also trees) once it received a push from local, after rebasing local and skipping some commits that introduced those files and also these commits deleted those files. Since these skipped commits are no longer in the history chain of commits it's logical that remote delete these objects as they are now not part of any commit in the history. This graph may explain it:

This is the history before rebase --onto

 * b5b7c142 after-deleting offending-file
 * db759b06 deleted offending-file
 * 59a9440a added offending-file
 * 933729b1 before-adding-offending-file

which was pushed to the remote before I regret it. But here comes the attempt to fix it...

rebase --onto 933729b1 db759b06

which effectively reconstructs commit b5b7c142 after-deleting offending-file

to have a different parent: 933729b1 before-adding-offending-file and leaving the middle two commits simply ignored.

This is how it looks after the rebase above: (please note that first commit SHA1 changed because we changed parent)

* 17c95f49 after-deleting offending-file
| * db759b06 deleted offending-file
| * 59a9440a added offending-file
| /
* 933729b1 before-adding-offending-file

and it's looking ok for a history on local and that file object still exists in .git/objects, it's a part of some commits that are here still. Now what happens if I pushed now to the remote? Will it delete that file object in .git/objects on github as it's now not part of any commit/tree? And if not, how can I do that?

torek · Accepted Answer

GitHub may or may not delete the unreachable commit and file some time in the future. It's up to them.

A normal everyday Git repository—one you control, for instance—will generally drop the unreferenced commit entirely when git gc runs. For that to happen, though, first all references have to go away. Using git rebase leaves several references behind, on purpose:

There is an entry in the HEAD reflog (viewable with git reflog).
There is an entry in the branch reflog (viewable with git reflog branch).
There is a reference in ORIG_HEAD.

The last one will be overwritten with the next operation that saves the previous HEAD value in ORIG_HEAD. The other two will eventually be dropped due to reflog entry expiration. Each reflog entry is timestamped, and is "live" until the current time is more than the expiration time added to the entry's timestamp. Another of git gc's functions is to check for expired entries, which it will delete. The expiration time is under your control, and is both 30 days and 90 days by default. This part is confusing (how can it be both?) but is not really relevant to the GitHub variant because they don't use the reflogs like this: the point is that the references have to be really gone, which takes time, and this part is true for GitHub as well.

Once the references are really gone, a git gc would discard the internal objects that hold the unwanted commit and file, provided that they're not in a kept pack. Kept packs are something you have to create on your own—Git doesn't do this itself—so if you're not doing that, you personally won't encounter this.

The main issue you'll have with GitHub is that you don't know when they will scrub their last reference, nor when they will subsequently run a git gc that will discard the object—plus, they add special refs for pull requests, issues, and other items, which can keep objects alive indefinitely. The upshot of all of this is that you cannot predict when or even whether some file will disappear from GitHub.

Note that you can contact GitHub support and get them to do a manual scrub. Of course, by then, any number of people could have obtained this file, so if there's any sensitive data in it, consider it to be well-known to the black-hat hacker community by now.

Removing unreferenced objects from remote

Answers (1)

Related Questions