Grubsnik
Grubsnik

Reputation: 929

SVN repo vastly bigger than the dumpfile?

I've been put in charge of migrating our SVN installation from version 1.5.6 to 1.7.6. As part of that i did a dump/load cycle of both our repositories and happened to notice something odd..

One of the repos "dumps" to a 2GB file, but after loading it, it takes up nearly 23GB of diskspace. This was also an issue in 1.5.6, but we were hoping the upgrade might help with that.

The repo in question is a little "odd" in that it contains a single folder with 7500 files (used to be up to 12000) and a subfolder with another 500 or so files, and that is it.

It would appear that it may be related to this issue: 350GB SVN repo creates atleast 1MB revision for even a simplest task like branch/tag

I am very much at a loss for what we can do about this right now, but the repo is presently growing at a ridiculus pace and we will need to relocate it if we don't get it solved. A task I was hoping to avoid.

Upvotes: 4

Views: 1328

Answers (1)

Andrew Alcock
Andrew Alcock

Reputation: 19651

First, SVN has two different repository backends: BDB (Berkley DB) and FSFS (File system). How the repository exists on disk is dependent on this choice, with the BDB typically being a bit larger. Which do you use?

If you use FSFS, then you should read up on sharding: when you commit a change, however small, it will be committed into a file whose minimum size is set by the disk - normally 2kb -16kb. Now multiply that up by the number of files being committed, and you can get very big numbers. The good news is that you can run a command to condense the shards into a single file:

svnadmin pack /path/to/repository

This might greatly improve your on-disk size.

Or the space problem might be the massive-number-of-files-per-commit problem you mention.

In any case, you ask why the dump file is very much smaller than the repository size. The dump file is a single file in a format that essentially is every commit ever made on the repository - this is a very terse form of the repository (especially if --deltas is used). Since this is placed into a single file, the issue of sharding is avoided.

I used to use and champion SVN in a previous organisation. Recently I moved myself to the Mercurial DVCS (also called Hg, and is similar to Git). Once you have made the switch, it's difficult ever thinking of going back. Anyway, here is a quote from Softpedia about repository size:

Disk space: When the Mozilla project was ported from SVN to Mercurial (very similar to Git in performance), disk space usage went down from 12GB to 420MB, 30 times smaller than the original size. Git is supposed to use the same storage algorithms, so file size should be around the same value.

You might want to investigate what would happen in your case if you switched to Hg or Git. If it is as dramatic as Softpedia's example, you could recommend Hg/Git to your management.

Upvotes: 1

Related Questions