lenz
lenz

Reputation: 5817

Compressing large, near-identical files

I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.

I would like to compress them in an archive. My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.

Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one. Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?

Upvotes: 1

Views: 207

Answers (1)

Mark Adler
Mark Adler

Reputation: 112284

Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.

Upvotes: 2

Related Questions