Shubhjot
Shubhjot

Reputation: 93

Why zipping HDF5 file is still getting a good amount of compression even if all datasets are compressed inside the file?

I am using HDF5 file system in my desktop application. I have used GZIP level 5 compression with all the datasets inside the file.

But still when I am zipping the HDF5 file using 7zip, the file size is getting even smaller by around half to one third!!!

The process I am following is:

  1. Generating the HDF5 file.
  2. Importing data in the file.
  3. Freeing up unaccounted space, if any, using h5repack utility.
  4. Using 7zip I am zipping the file to .zip

How is it possible?

Where is the scope of more compression?

How to generate an even smaller HDF5 file? Any suggestions about the using property(H5P).

I thought that 7zip maybe ruthlessly compressing my file using GZIP level 9 but I tried using GZIP level 9 in my HDF5 file. New file size is still the half of the original.

Upvotes: 2

Views: 2775

Answers (2)

Quincey Koziol
Quincey Koziol

Reputation: 140

You are applying compression to only the dataset elements in the HDF5 file. Other components of the HDF5 file (internal metadata and objects such as groups) aren't compressed. So, when you compress the entire file, those other components compress, and the already compressed dataset elements could compress some more also.

Upvotes: 3

Mark Adler
Mark Adler

Reputation: 112374

gzip has a maximum compression ratio of about 1000:1. If the data is more compressible than that, then you can compress it a second time to get more compression (the second time could be gzip again). You can do a simple experiment with a file consisting of only zeros:

% dd ibs=1 count=1000000 < /dev/zero > zeros
% wc -c zeros
1000000
% gzip < zeros | wc -c
1003
% gzip < zeros | gzip | wc -c
64

So what was the compression ratio of your first compression?

Upvotes: 3

Related Questions