Powerful Weapon
Powerful Weapon

Reputation: 43

Why is the compressed Subversion dump file larger than the original?

We are using SVN 1.7 on Solaris 10. Recently we've introduced compressed, incremental backups.

$ svnadmin dump --quiet --incremental --revision 0:30700 /path/to/repo > /path/to/dump
$ gzip -1 /path/to/dump

The final gzipped dump file is larger (~850MB) than the original dump file (~500MB). I tried gzip -9 as well but that still creates a larger file than the original (~650MB).

Upvotes: 2

Views: 1642

Answers (1)

Gunter Ohrner
Gunter Ohrner

Reputation: 111

Unfortunately, you did not describe the structure and contents of your repository.

Possibly, you're storing data which is already compressed with an efficient compression algorithm (like, eg. 7z / LZMA).

This data will appear in the svnadmin dump data stream and cannot be compressed further with gzip, leading to a growth in file size.

Lossless data compression algorithms cannot further shrink already compressed or encrypted data significantly. If you had an algorithm which would be guaranteed to shrink its input data, you could just apply it iteratively to shrink your data down to a single byte, which clearly is not possible.

Lossless compression algorithms work by removing redundancy in the input data, and after applying the algorithm this redundancy is already significantly reduced such that subsequent applications of compression algorithms will not be able to change much.

In fact, depending on the compression algorithm used and its output data format, the resulting data size will possibly grow due to control and escaping information injected by the algorithm.

You could try to invoke svnadmin with the --deltas option which will output only the data differing in each revision, so basically patches between revisions. Without --deltas it will output the full data of changed files.

However if you're managing already compressed files in your repository, this will not make much (or any) of a difference, as compressed data also cannot be diffed properly. (Some modified compression algorithms exists like eg. patched gzip versions with the --rsyncable parameter or the gzip-compatible pigz tool, which allow this with certain limitations and at the expense of compression efficiency.)

You probably tried to do this with the --incremental flag you provided, but this means something else. It's only relevant if you dump ranges of revisions and only controls if the first revision contains a complete dump of this revision or only of files changed in this revision. So it won't have any effect if you dump from revision 0 anyway.

Upvotes: 1

Related Questions