Accessing HDF5's shape is much slower (45 times) when the hdf5 file is larger

Question

I am having significant read speed reductions when accessing data recursively for a very large hdf5 with lots of datasets inside.

There are two HDF5, "small.hdf5" and "large.hdf5".

small.hdf5 is 122GB and has 119,189 datasets inside
large.hdft is 1.5TB and has 1,000,416 datasets inside

When I run the code below (along with a profiler) I find that the the for loop speed of looping through small.hdf5 is around 9000it/s while for large.hdf5 it is around 200it/s, a 45 times reduction in speed.

As the results below show, it appears that the __getitem__ takes MUCH longer with the increasing number of groups in a given HDF5 file (i.e. during self.__file[subject]['eeg'].shape[1].

Is my assessment correct? Is there a way to resolve this?

I don't know much about programming but could this be caused by the increased number of groups h5py needs to search for when I query the hdf5 file? Should I instead do a for loop over the datasets itself instead of querying it sequentially?

Thank you in advance for any suggestion/comments!

self.__file = h5py.File(str(self.__file_path), 'r',)
self.__subjects = [i for i in self.__file]

import cProfile
import pstats
import io
profiler = cProfile.Profile()
profiler.enable()
ssum = 0 
for subject in tqdm(self.__subjects, desc="Processing subjects (SingleShockDataset)"):
    subject_len = self.__file[subject]['eeg'].shape[1]
    
    #for early breaking as it takes too long 
    ssum += 1
    if ssum > 10000:
        break
profiler.disable()
s = io.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(profiler, stream=s).sort_stats(sortby)
ps.print_stats(10)
print(s.getvalue())

results, for the small.hdf5

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      2/1    0.000    0.000    1.292    1.292 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:637(wait)
      2/1    0.000    0.000    1.292    1.292 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:323(wait)
      9/3    0.168    0.019    1.292    0.431 {method 'acquire' of '_thread.lock' objects}
    20002    0.598    0.000    1.047    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:348(__getitem__)
    10001    0.185    0.000    0.208    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:659(__init__)
    10001    0.024    0.000    0.129    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:278(file)
    10001    0.052    0.000    0.091    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/files.py:376(__init__)
    10001    0.066    0.000    0.067    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:485(shape)
    60007    0.029    0.000    0.044    0.000 :1390(_handle_fromlist)
    60007    0.021    0.000    0.034    0.000 :645(parent)

while for the large.hdf5

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        7    0.101    0.014   82.463   11.780 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:637(wait)
    20002   65.130    0.003   66.293    0.003 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:348(__getitem__)
        7    0.013    0.002   62.463    8.923 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:323(wait)
       28    0.214    0.008   42.437    1.516 {method 'acquire' of '_thread.lock' objects}
    10001    0.596    0.000    0.657    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:659(__init__)
    10001    0.059    0.000    0.238    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:278(file)
    10002    0.028    0.000    0.198    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:1160(__iter__)
      659    0.005    0.000    0.168    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:1198(update)
    10001    0.164    0.000    0.165    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:485(shape)
      660    0.003    0.000    0.159    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:1325(refresh)

we can clearly see that the __getitem__ is causing the issue.

Below is the configuration that I ran the thing with

Summary of the h5py configuration
---------------------------------

h5py    3.12.1
HDF5    1.14.3
Python  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.26.4
cython (built with) 3.0.11
numpy (built against) 2.0.2
HDF5 (built against) 1.14.3

UPDATE : for loop directly over group does not work

for group_name, group in tqdm(self.__file.items(), desc="Processing groups (SingleShockDataset)"):
           subject_len = group['eeg'].shape[1]

It seems that running the above code (i.e. so that group doesn't have to be searched each iteration), is still slow

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   20002   62.178    0.003   63.334    0.003 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:348(__getitem__)
   10002    0.029    0.000   62.840    0.006 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:1160(__iter__)
   10002    0.021    0.000   62.657    0.006 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:431(__iter__)
       7    0.000    0.000   61.521    8.789 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:637(wait)
       7    0.006    0.001   61.521    8.789 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:323(wait)
      28    0.379    0.014   61.515    2.197 {method 'acquire' of '_thread.lock' objects}
   10001    0.019    0.000   61.434    0.006 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:372(get)
   10002    1.182    0.000    1.196    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:496(__iter__)
   10001    0.601    0.000    0.662    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:659(__init__)
       1    0.000    0.000    0.421    0.421 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:952(__init__)

UPDATE-2 : I tried to load them in batches like below, but the speed essentially didn't change

    def  _collect_eeg_datasets(self):
        eeg_datasets = []

        def visitor(name, obj):
            if isinstance(obj, h5py.Dataset) and name.endswith('eeg'):
                eeg_datasets.append(obj)

        self.__file.visititems(visitor)
        return eeg_datasets

def ....: 

        dataset_names = list(self.__file.keys())
        num_datasets = len(dataset_names)   
        batch_size = 50
        for i in tqdm(range(0, num_datasets, batch_size), desc="Processing EEG datasets in Batches"):
            batch_names = dataset_names[i:i + batch_size]
            # Access datasets without loading data into RAM
            eeg_datasets = [self.__file[name]['eeg'] for name in batch_names]
            for eeg_dataset in eeg_datasets:
                subject_len = eeg_dataset.shape[1]
                ssum += 1
            if ssum > 10000:
                break
        profiler.disable()
        s = io.StringIO()
        sortby = 'cumulative'
        ps = pstats.Stats(profiler, stream=s).sort_stats(sortby)
        ps.print_stats(10)
        print(s.getvalue())

   Ordered by: cumulative time
   List reduced from 135 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    20100   40.024    0.002   41.035    0.002 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:348(__getitem__)
       16    0.380    0.024   39.628    2.477 {method 'acquire' of '_thread.lock' objects}
  1000417    0.121    0.000    1.777    0.000 :868(__iter__)
  1000417    1.354    0.000    1.656    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:496(__iter__)
    10050    0.544    0.000    0.594    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:659(__init__)
      408    0.000    0.000    0.422    0.001 {built-in method builtins.len}
        1    0.000    0.000    0.422    0.422 :848(__len__)
        1    0.422    0.422    0.422    0.422 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:491(__len__)
  1000416    0.156    0.000    0.302    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:208(_d)
    10050    0.053    0.000    0.225    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:278(file)

Accessing HDF5's shape is much slower (45 times) when the hdf5 file is larger

Answers (1)

Related Questions

Accessing HDF5&#39;s shape is much slower (45 times) when the hdf5 file is larger

Answers (1)

Related Questions

Accessing HDF5's shape is much slower (45 times) when the hdf5 file is larger