Reputation: 97
I faced some problems with decompression in zstd case. I have hdf5-format files, that was compressed in the following way:
import h5py as h5
import hdf5plugin
import sys
import os
filefrom = sys.argv[1]
h5path = sys.argv[2]
f = h5.File(filefrom,'r')
data = f[h5path]
shape_data = data.shape[1:]
num = data.shape[0]
initShape = (1,) + shape_data
maxShape = (num,) + shape_data
f_zstd = h5.File(filefrom.split('.')[0]+'_zstd.h5','w')
d_zstd = f_zstd.create_dataset(path_to_data, initShape, maxshape=maxShape, dtype=np.int32, chunks=initShape, **hdf5plugin.Zstd())
d_zstd[0,] = data[0,]
for i in range(num):
d_zstd.resize((i+1,) + shape_data)
d_zstd[i,] = data[i,]
f_zstd.close()
f.close()
So it compressed without any errors, but then when I try to look into the data with h5ls
or h5dump
it prints me out that data can't be printed, and no another way to look inside the file like reading in python3 (3.6) with h5py this compressed data is unsuccessful. I also tried h5repack
(h5repack -i compressed_file.h5 -o out_file.h5 --filter=var:NONE
) or the following piece of code:
import zstandard
import pathlib
import os
def decompress_zstandard_to_folder(input_file):
input_file = pathlib.Path(input_file)
destination_dir = os.path.dirname(input_file)
with open(input_file, 'rb') as compressed:
decomp = zstandard.ZstdDecompressor()
output_path = pathlib.Path(destination_dir) / input_file.stem
with open(output_path, 'wb') as destination:
decomp.copy_stream(compressed, destination)
nothing succeed. In situation with h5repack
no warnings or errors appeared, with the last piece of code I got this zstd.ZstdError: zstd decompressor error: Unknown frame descriptor
, so as I got it means that compressed data doesn't have the appropriete headers.
I use python 3.6.7
, hdf5 1.10.5
. So I'm a bit confused and don't have any idea how to overcome this issue.
Will be happy for any ideas/advice!
Upvotes: 1
Views: 1509
Reputation: 8006
I wrote a simple test to validate zstd compression behavior with a simple dataset (NumPy array of int32). I can open the HDF5 file with h5py and read the dataset. (Note: I could not open with HDFView and h5repack only reports shape and type attributes, not the data.)
I suspect an undetected error in another part of your code. Have you tested your code logic without zstd compression? If not, I suggest you start there.
Code to Write example file:
import h5py as h5
import hdf5plugin
import numpy as np
data = np.arange(1_000).reshape(100,10)
with h5.File('test_zstd.h5','w') as f_zstd:
d_zstd = f_zstd.create_dataset('zstd_data', data=data, **hdf5plugin.Zstd())
Code to Read example file:
import h5py as h5
import hdf5plugin ## Note: plugin required to read
with h5.File('test_zstd.h5','r') as f_zstd:
d_zstd = f_zstd['zstd_data']
print(d_zstd.shape, d_zstd.dtype)
print(d_zstd[0,:])
print(d_zstd[-1,:])
Output from above:
(100, 10) int32
[0 1 2 3 4 5 6 7 8 9]
[990 991 992 993 994 995 996 997 998 999]
More on HDF5 and compression:
To use HDF5 utilities (like h5repack) to read a compressed file, the HDF5 installation needs the appropriate compression filter. Some are standard, many (including xstandard), require you to install a third party filter. Links to available plugins are here: HDF5 Registered Filter Plugins
You can verify the compression filter with h5dump by adding the -pH
flag, like this:
E:\SO_68526704>h5dump -pH test_zstd.h5
HDF5 "test_zstd.h5" {
GROUP "/" {
DATASET "zstd_data" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 100, 10 ) / ( 100, 10 ) }
STORAGE_LAYOUT {
CHUNKED ( 100, 10 )
SIZE 1905 (2.100:1 COMPRESSION)
}
FILTERS {
USER_DEFINED_FILTER {
FILTER_ID 32015
COMMENT Zstandard compression: http://www.zstd.net
}
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_ALLOC
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
}
}
}
Upvotes: 1