user2501886
user2501886

Reputation:

Git: 'Permanent deletion' (of a branch) without rebase or filter-branch

Motivation: I have a somewhat specific scenario, for which Git seems to be a good fit but which is nonetheless unusual enough to require some particular work. Basically it's a bunch of text files (no code) which are being updated automatically at the very least every 10 seconds or so. The changes may be considerable, and at least over time the size of the repository becomes relatively large. The local repository is on an embedded system without constant network connectivity, so the natural workflow is to collect the commits locally, push them when there is an opportunity, then delete what was just pushed to free up space, if necessary. The history may be useful to keep on the device temporarily, but most importantly it should be possible to eliminate it from the device. (It is kept forever on the remote). Depending on a few application-specific factors the scenario may be extended somewhat, and we may leverage additional feature in git, but the basic structure I outlined should stay the same.

More specifically, there is one local copy of the repository and one remote, and the local only ever pushes (a particular branch) to the remote (never pulls). The commit graph is simple, a 'straight line' of commits one after the other with no merging or parallel lines. Whenever there is an opportunity to push (as discussed above), a new branch will be created for further commits. So every so often we have a new branch, which basically just functions to organize the timeline of commits. Aside from this, we never switch branches.

So, the old branches can be removed, and as discussed this is our goal especially when space becomes a concern. To 'permanently delete' the commits and the branch, we tried the following:

date=$(date +"%m-%d-%y--%H-%M-%S")

git $opt checkout -b "$date"

git $opt branch -d $to_push

# the first commit will be the single 'initial' commit in the master
# branch, which is permanent and never 'deleted'
git $opt replace --graft \
    $(git $opt log -n 1 --pretty="%H") \
    $(git $opt rev-list --max-parents=0 HEAD)

git $opt reflog expire --expire=now --all
git $opt gc --aggressive --prune=now
git $opt repack -a -d -l

The opt variable is just specifying the work tree and git dir. The graft we perform (with the subsequent gc etc) successfully eliminates the commits from a naive git log, and indeed does free some space, but it does not seem to free the space occupied 'by the diffs which are still being held in the commits'; for example, a large file that is created, committed, and then deleted, will still continue to occupy space after its commits are eliminated in this fashion. We won't have any particularly large files in practice, but I assume this behavior is more general in that the 'data from the changes' (diffs?) are still being kept in the repository, or something like that, which is what we care about eliminating.

I managed to widdle the remaining structure down with some tricks that were suggested to me such as removing the branches from the 'fetch' glob in the config and running git fetch --prune origin; and git update-ref -d refs/remotes/origin/05-07-16--15-48-59 for example, but this did not free the space in question. The following data describes the state of the repository as it stands currently:

$ git log --all --oneline --graph --decorate
* de345b6 (HEAD -> 05-07-16--15-50-56, replaced) sam. mai  7 15:44:16 EDT 2016
| * 50272b5 sam. mai  7 15:44:16 EDT 2016
|/  
| * 0b96272 sam. mai  7 15:29:48 EDT 2016
|/  
| * b764118 sam. mai  7 15:28:13 EDT 2016
|/  
| * efa0536 sam. mai  7 15:14:45 EDT 2016
|/  
| * 40c8806 sam. mai  7 15:13:57 EDT 2016
|/  
| * 6f7c2f9 sam. mai  7 15:12:26 EDT 2016
|/  
| * fa33771 sam. mai  7 15:11:21 EDT 2016
|/  
| * 8698acd sam. mai  7 15:11:08 EDT 2016
|/  
* b2d9486 (origin/master, master) initial
$ git show-ref
de345b670e24ac68bbbf4aa7efd22598ef3c7251 refs/heads/05-07-16--15-50-56
b2d9486d5d427d1ae4bb88828f334454a2fb6954 refs/heads/master
b2d9486d5d427d1ae4bb88828f334454a2fb6954 refs/remotes/origin/master
0b96272e47cab0b29e2706cae83b8154f8e412ea refs/replace/0afdaca4e6d071fc026d209249a7b0532c11122a
b7641184c898ff08917d363435d5f45e5e9664ed refs/replace/498f8846c6a742f96997b599f5e25f5ad20b568c
6f7c2f9b7700b39b4fd837c34ab7911a08d5438a refs/replace/4df4f9cf8cc01500c800f3f04cbbd655a866c9ba
8698acd667d406fab764389b87518d133de887a6 refs/replace/9a91b7248da808a9fc6e1531c4206a6865273005
40c880617db664cb73390d90e1401a049bc8c303 refs/replace/9edc1e243f4f36034a800c566fdeeac511e077a3
efa0536a40e68d92751193fa0c6dec502d77ce72 refs/replace/d6256dbe48a10461e17ca3cf7e7c40700937d249
fa3377117750fd81c703519038268fec89b65dce refs/replace/db9923391013d8e5d2974f328037f6315af85783
50272b55f66b8d7c55305a3502db8e9f88b2db03 refs/replace/de345b670e24ac68bbbf4aa7efd22598ef3c7251

Regarding the criteria mentioned in the subject, we do not want to do a rebase or filter-branch because the data in the work tree is live and being updated frequently, as discussed. I suppose we could copy the work tree somewhere else then perform the deletion there, but that exacerbates the space constraint even more. And even if we did copy it elsewhere and successfully deleted old data with rebase or filter-branch, we'd need to rsync any new changes in the live repository over to the copied one and copy the copied one back into the live one, all atomically with respect to the processes which are actively reading from and/or modifying the contents of the repository, which seems like unnecessary hassle but we are open to it.

Another suggestion we were given was to use format-patch and am to 'serialize' the commits and reconstruct the structure on the remote repository after transferring it in the form of text file patches. Then we could just create a new repository on the local to get rid of the old data. But this also sounds unnecessarily complex, and basically seems like re-doing the work that git is designed to do. We are open to this possibility (or the possibility of switching to another VCS for this, or something custom), but it seems like we are tantalizingly close to getting this to work, and git seems to fit our use case pretty well otherwise.

I can provide more details, and I can also recreate the repository and try different steps and/or show command output at various steps in the process. Thanks for your time.

Edit

After the suggestion of Vampire, and his request for additional information:

$ git rev-list --all | xargs -l $git describe --all --always            
replace/de345b670e24ac68bbbf4aa7efd22598ef3c7251
replace/0afdaca4e6d071fc026d209249a7b0532c11122a
replace/498f8846c6a742f96997b599f5e25f5ad20b568c
replace/d6256dbe48a10461e17ca3cf7e7c40700937d249
replace/9edc1e243f4f36034a800c566fdeeac511e077a3
replace/4df4f9cf8cc01500c800f3f04cbbd655a866c9ba
replace/db9923391013d8e5d2974f328037f6315af85783
replace/9a91b7248da808a9fc6e1531c4206a6865273005
heads/05-07-16--15-50-56

Upvotes: 0

Views: 164

Answers (1)

Vampire
Vampire

Reputation: 38734

Your problem is that you use git replace.
git replace makes git pretend that one commit is in reality another commit or as in your case the parent of one commit is the parent of another commit.
But the original objects are still there, they are just logically replaced for most git commands, but not physically replaced unless you do it with a rebase or filter-branch or similar.

But if I didn't get you wrong, what you are really after is simple the following:

git reset --soft <initial commit>
git commit -m "recording current state as the only commit after the initial commit"

and then the repacking and stuff to wipe out the trash

You can even stuff those two commands inside a git alias to make them an atomic operation in Git as far as I remember.

Upvotes: 1

Related Questions