Reputation: 8557
I'm reading the book Python and HDF5 (O'Reilly) which has a section on empty datasets and the size they take on disk:
import numpy as np
import h5py
f = h5py.File("testfile.hdf5")
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32)
f.flush()
# Size on disk is 1KB
dset[0:1024] = np.arange(1024)
f.flush()
# Size on disk is 4GB
After filling part (first 1024 entries) of the dataset with values, I expected the file to grow, but not to 4GB. It's essentially the same size as when I do:
dset[...] = np.arange(1024**3)
The book states that the file size on disk should be around 66KB. Could anyone explain what the reason is for the sudden size increase?
Version info:
Upvotes: 0
Views: 1193
Reputation: 5546
If you open your file in HdfView you can see that chunking is off. This means that the array is stored in one contiguous block of memory in the file and cannot be resized. Thus all 4 GB must be allocated in the file.
If you create your data set with chunking enabled, the dataset is divided up into regularly-sized pieces which are stored haphazardly on disk, and indexed using a B-tree. In that case only the chunks that have (at least one element of) data are allocated on disk. If you create your dataset as follows the file will be much smaller:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=True)
The chunks=True
lets h5py
determine the size of the chunks automatically. You can also set the chunk size explicitly. For example, to set it to 16384 floats (=64 Kb), use:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=(2**14,) )
The best chunk size depends on the reading and writing patterns of your applications. Note that:
Chunking has performance implications. It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.
See http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage
Upvotes: 3