rubo77
rubo77

Reputation: 20865

truncate git repository keepinng regular snapshots

I want to keep track on a 500kb json text that changes its content minutely. I would like to use git, so I can use git pull on another server to download the latest version of that file without the problem, that the file could change during download and I also want to have a versioning of that file for the last months/years this way at the same time.

I thought of creating a git repository where I commit every file change, but I noticed, after some days, this repository gets the size of many GB (even with git gc because there changes so much in the file)

I could truncate the git regularly to a particular depth but that is not what I need. I need the information, how the file looked like a week ago, a month ago, a year ago. although I don't need as many commits the longer in the past it is.

Is this even possible with git and some bash magic? I am fine with deleting and recreating the repository and using --amend in that git

Or would you suggest another solution?

Upvotes: 0

Views: 77

Answers (1)

Mark Adelsberger
Mark Adelsberger

Reputation: 45779

There is at least one way to do this; I'll outline an approach below. First a few things to think about:

Depending on the nature of the changes that occur, you might want to see if frequent packing of the database might help; git is pretty good at avoiding wasted space (for text files, at least).

Of course with the commit load you describe - 1440 commits per day, give or take? - the history will tend to grow. Still, unless the changes are dramatic on every commit, it seems like it could be made better than "many GB in a few days"; and maybe you'd reach a level where a compromise archiving strategy would become practical.

It's always worth thinking, too, about whether "all the data I need to keep" is bigger than "all the data I need regular access to"; because then you can consider whether some of the data should be preserved in archive repos, possibly on backup media of some form, rather than as part of the live repo.

And, as you allude in your question, you might want to consider whether git is the best tool for the job. Your described usage doesn't use most of git's capabilities; nor does it exercise the features that really make git excel. And conversely, other tools might make it easier to progressively thin out the history.

But with all of that said, you still might reach the decision to start with "per minute" data, then eventually drop it to "per hour", and maybe still later reduce to *per week".

(I'd discourage defining too many levels of granularity; the most "bang for your buck" will come with discarding sub-hourly snapshots. Hour->day would be borderline, day->week would probably be wasteful. If you get down to weekly, that's surely sparse enough...)

So when some data "ages out", what to do? I suggest that you could use some combination of rebasing (and/or related operations), depth limits, and replacements (depending on your needs). Depending on how you combine these, you could keep the illusion of a seamless history without changing the SHA ID of any "current" commit. (With more complex techniques, you could even arrange to never change a SHA ID; but this is noticeably harder and will reduce the space savings somewhat.)

So in the following diagrams, there is a root commit identified as 'O'. Subsequent commits (the minutely changes) are identified by a letter and a number. The letter indicates the day the commit was created, the numbers sequentially mark off minutes.

You create your initial commit and place branches on it for each granularity of history you'll eventually use. (As changes accumulate each minute, they'll just go on master.)

O <--(master)(hourly)(weekly)

After a couple days you have

O <-(hourly)(weekly)
 \
  A1 - A2 - A3 - ... - A1439 - A1440 - B1 - B2 - ... - B1439 - B1440 - C1 <--(master)

And maybe you've decided that at midnight, any sub-hour snapshot that's 24 hours old can be discarded.

So as day C starts, the A snapshots are older than 24 hours and should be reduced to hourly snapshots. First we must create the hourly snapshots

git checkout hourly
git merge --squash A60
git commit -m 'Day A 1-60'
git merge --squash A120
git commit -m 'Day A 61-120'
...

And this gives you

O <-(weekly)
|\
| A60' - A120' - ... - A1380' - A1440' <-(hourly)
 \
  A1 - A2 - A3 - ... - A1439 - A1440 - B1 - B2 - ... - B1439 - B1440 - C1 <--(master)

Here A1440' is a rewrite of A1440, but with a different parentage (such that its direct parent is "an hour ago" instead of "a minute ago").

Next, to make the history seamless you would have B1 identify A1440' as its parent. If you don't care about changing the SHA ID of every commit (including current ones), a rebase will work

git rebase --onto A1440' A1440 master

Or in this case (since the TREEs at A1440 and A1440' are the same) it would be equivalent to re-parent B1 - see the git filter-branch docs for details of that approach. Either way you would end up with

O <-(weekly)
|\
| A60' - A120' - ... - A1380' - A1440' <-(hourly)
|                                     \
|                                      B1' - B2' - ... - B1439' - B1440' - C1' <-(master)
 \
  A1 - A2 - A3 - ... - A1439 - A1440 - B1 - B2 - ... - B1439 - B1440 - C1

Note that even though the granularity of changes in the B and C commits is unchanged, these are still "rewritten" commits (hence the ' notation); and in fact the original commits have not yet been physically deleted. They are unreachable, though, so they'll eventually be cleaned up by gc; if it's an issue, you can expedite this by discarding reflogs that are more than 24 hours old and then manually running gc.

Alternatively, if you want to preserve SHA ID's for the B and C commits, you could use git replace.

git replace A1440 A1440'

This has a number of drawbacks, though. There are a few known quirks with replacements. Also in this scenario the original commits are not unreachable (even though they aren't shown by default); you would have to shallow the master branch to get rid of them. The simplest way to shallow a branch is to clone the repo, but then you have to jump through extra hoops to propagate the replacement refs. So this is an option if you never want the master ref to "realize" it's moving in an abnormal way, but not as simple.

Upvotes: 1

Related Questions