Danny  Han
Danny Han

Reputation: 187

Accessing HDF5's shape is much slower (45 times) when the hdf5 file is larger

I am having significant read speed reductions when accessing data recursively for a very large hdf5 with lots of datasets inside.

There are two HDF5, "small.hdf5" and "large.hdf5".

When I run the code below (along with a profiler) I find that the the for loop speed of looping through small.hdf5 is around 9000it/s while for large.hdf5 it is around 200it/s, a 45 times reduction in speed.

As the results below show, it appears that the __getitem__ takes MUCH longer with the increasing number of groups in a given HDF5 file (i.e. during self.__file[subject]['eeg'].shape[1].

Is my assessment correct? Is there a way to resolve this?

I don't know much about programming but could this be caused by the increased number of groups h5py needs to search for when I query the hdf5 file? Should I instead do a for loop over the datasets itself instead of querying it sequentially?

Thank you in advance for any suggestion/comments!

self.__file = h5py.File(str(self.__file_path), 'r',)
self.__subjects = [i for i in self.__file]

import cProfile
import pstats
import io
profiler = cProfile.Profile()
profiler.enable()
ssum = 0 
for subject in tqdm(self.__subjects, desc="Processing subjects (SingleShockDataset)"):
    subject_len = self.__file[subject]['eeg'].shape[1]
    
    #for early breaking as it takes too long 
    ssum += 1
    if ssum > 10000:
        break
profiler.disable()
s = io.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(profiler, stream=s).sort_stats(sortby)
ps.print_stats(10)
print(s.getvalue())
        

results, for the small.hdf5

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      2/1    0.000    0.000    1.292    1.292 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:637(wait)
      2/1    0.000    0.000    1.292    1.292 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:323(wait)
      9/3    0.168    0.019    1.292    0.431 {method 'acquire' of '_thread.lock' objects}
    20002    0.598    0.000    1.047    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:348(__getitem__)
    10001    0.185    0.000    0.208    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:659(__init__)
    10001    0.024    0.000    0.129    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:278(file)
    10001    0.052    0.000    0.091    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/files.py:376(__init__)
    10001    0.066    0.000    0.067    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:485(shape)
    60007    0.029    0.000    0.044    0.000 <frozen importlib._bootstrap>:1390(_handle_fromlist)
    60007    0.021    0.000    0.034    0.000 <frozen importlib._bootstrap>:645(parent)
    
    

while for the large.hdf5

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        7    0.101    0.014   82.463   11.780 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:637(wait)
    20002   65.130    0.003   66.293    0.003 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:348(__getitem__)
        7    0.013    0.002   62.463    8.923 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:323(wait)
       28    0.214    0.008   42.437    1.516 {method 'acquire' of '_thread.lock' objects}
    10001    0.596    0.000    0.657    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:659(__init__)
    10001    0.059    0.000    0.238    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:278(file)
    10002    0.028    0.000    0.198    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:1160(__iter__)
      659    0.005    0.000    0.168    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:1198(update)
    10001    0.164    0.000    0.165    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:485(shape)
      660    0.003    0.000    0.159    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:1325(refresh)

we can clearly see that the __getitem__ is causing the issue.

Below is the configuration that I ran the thing with

Summary of the h5py configuration
---------------------------------

h5py    3.12.1
HDF5    1.14.3
Python  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.26.4
cython (built with) 3.0.11
numpy (built against) 2.0.2
HDF5 (built against) 1.14.3

UPDATE : for loop directly over group does not work

for group_name, group in tqdm(self.__file.items(), desc="Processing groups (SingleShockDataset)"):
           subject_len = group['eeg'].shape[1]

It seems that running the above code (i.e. so that group doesn't have to be searched each iteration), is still slow

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   20002   62.178    0.003   63.334    0.003 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:348(__getitem__)
   10002    0.029    0.000   62.840    0.006 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:1160(__iter__)
   10002    0.021    0.000   62.657    0.006 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:431(__iter__)
       7    0.000    0.000   61.521    8.789 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:637(wait)
       7    0.006    0.001   61.521    8.789 /global/common/software/m4244/DIVER/lib/python3.12/threading.py:323(wait)
      28    0.379    0.014   61.515    2.197 {method 'acquire' of '_thread.lock' objects}
   10001    0.019    0.000   61.434    0.006 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:372(get)
   10002    1.182    0.000    1.196    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:496(__iter__)
   10001    0.601    0.000    0.662    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:659(__init__)
       1    0.000    0.000    0.421    0.421 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/tqdm/std.py:952(__init__)

UPDATE-2 : I tried to load them in batches like below, but the speed essentially didn't change

    def  _collect_eeg_datasets(self):
        eeg_datasets = []

        def visitor(name, obj):
            if isinstance(obj, h5py.Dataset) and name.endswith('eeg'):
                eeg_datasets.append(obj)

        self.__file.visititems(visitor)
        return eeg_datasets

def ....: 

        dataset_names = list(self.__file.keys())
        num_datasets = len(dataset_names)   
        batch_size = 50
        for i in tqdm(range(0, num_datasets, batch_size), desc="Processing EEG datasets in Batches"):
            batch_names = dataset_names[i:i + batch_size]
            # Access datasets without loading data into RAM
            eeg_datasets = [self.__file[name]['eeg'] for name in batch_names]
            for eeg_dataset in eeg_datasets:
                subject_len = eeg_dataset.shape[1]
                ssum += 1
            if ssum > 10000:
                break
        profiler.disable()
        s = io.StringIO()
        sortby = 'cumulative'
        ps = pstats.Stats(profiler, stream=s).sort_stats(sortby)
        ps.print_stats(10)
        print(s.getvalue())
   Ordered by: cumulative time
   List reduced from 135 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    20100   40.024    0.002   41.035    0.002 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:348(__getitem__)
       16    0.380    0.024   39.628    2.477 {method 'acquire' of '_thread.lock' objects}
  1000417    0.121    0.000    1.777    0.000 <frozen _collections_abc>:868(__iter__)
  1000417    1.354    0.000    1.656    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:496(__iter__)
    10050    0.544    0.000    0.594    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/dataset.py:659(__init__)
      408    0.000    0.000    0.422    0.001 {built-in method builtins.len}
        1    0.000    0.000    0.422    0.422 <frozen _collections_abc>:848(__len__)
        1    0.422    0.422    0.422    0.422 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/group.py:491(__len__)
  1000416    0.156    0.000    0.302    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:208(_d)
    10050    0.053    0.000    0.225    0.000 /global/common/software/m4244/DIVER/lib/python3.12/site-packages/h5py/_hl/base.py:278(file)

Upvotes: 0

Views: 53

Answers (1)

kcw78
kcw78

Reputation: 8006

Your scenario caught my eye. I have not worked with H5 files the size of large.hdf5. However, I've worked with some close to 1 TB (but they did not have 1M groups/datsets). To compare performance, I created 2 simple programs 1) to create 2 H5 files and 2) to read the files. The small file has 100K groups and the large one has 1M groups. I did NOT see performance degradation when reading the larger file. (However, my datasets were tiny.)

When I run the program to read the files, it takes approx 0.28 sec to access the dataset shape for each 1,000 groups/datasets for both small.h5 and large.h5 . This time increment is consistent for every group/dataset in each file. One note: it does take longer to read the first 1,000 groups/datasets in large.h5 -- but it was only 4.5 sec. Incremental times dropped to match small.h5 after that. (I used a simple time function to capture timing data - it's sufficient for this test.)

So, clearly there is more to consider than just the number of groups/datasets. Possibilities include:

  • the size of your datasets
  • your dataset attributes (compression, chunking, etc)
  • your data structures (e.g. large list manipulation)
  • other overhead in your program
  • interaction with other modules (eg tqdm)

I have enclosed my simple programs. Feel free to run and observe their behavior. Then modify to see if you can replicate your behavior. Hopefully it helps you pinpoint the bottleneck.

To create the files:

arr = np.arange(100).astype('float').reshape(20,5)

small_no = 100_000
large_no = 1_000_000

with h5py.File('small.h5','w') as h5f:
    for g_cnt in range(1,small_no+1):
        grp = h5f.create_group(f'group_{g_cnt:06}')
        ds = grp.create_dataset('eeg',data=arr)
        
with h5py.File('large.h5','w') as h5f:
    for g_cnt in range(1,large_no+1):
        grp = h5f.create_group(f'group_{g_cnt:07}')
        ds = grp.create_dataset('eeg',data=arr)        

To read the files (change file name to large.h5 to read the large file).

t = time.time()
with h5py.File('small.h5') as h5f:
    for g_cnt, grp in enumerate(h5f):
        ds_size = h5f[grp]['eeg'].shape[1]
        if not ((g_cnt+1) % 1000):
            print(f'Incremental elapsed time at {(g_cnt+1):,}: {(time.time()-t):.2f}')
            t = time.time()

Upvotes: 1

Related Questions