Balfar
Balfar

Reputation: 135

What is the most efficient way to find all paths to particular values in HDF5 file with Python?

I am looking for negative values in a .hdf5 file that has the following architecture:

- Incidence_0
   - Wavelength_0
      - (Table of size m * n)
   - Wavelength_1
      - (Table of size m * n)
   ...
- Incidence_1
   ...
...

My objective is to find every negative value, and to get back its exact position in the file (ie, the number of the incidence, the number of the wavelength, and its position in the associated table).

I am sorry that I cannot give and minimal reproducible example because I cannot give the file that I'm using, but here is the idea.

import h5py

file = h5py.File('testFile.hdf5', 'r')

result = []

for incidence in range(nbIncidence):
    for wavelength in range(nbWavelength):
        for theta in range(nbTheta):
            for phi in range(nbPhi):
                value = file['Incidence_' + str(incidence)]['Wavelength_' + str(wavelength)][theta, phi]

                if (value < 0):
                    result.append([value, incidence, wavelength, theta, phi])

This is perfectly working, but using four loops is time-consuming, especially if I have to work on huge files, that may probably happen... I don't know enough the h5py library but I am pretty sure that it exists a way to do this way faster than that.

Upvotes: 0

Views: 634

Answers (1)

kcw78
kcw78

Reputation: 8046

First, the bad news: h5py doesn't have a function to interrogate your data in the way you described. The good news: you can accomplish your task by extracting each Incident/Wavelength dataset to an NumPy array, then combining 2 NumPy methods to operate on the extracted array. [Note: This assumes you have sufficient memory to load each dataset.]

Some observations on working with this data (to help you follow my example).

  1. The HDF5 file schema is self-describing. So you don't need to iterate over integer counters. Instead, you can get group and dataset names with the .keys() method. (Or, you can get (name, object) tuples with the .items() method.)
  2. I highly recommend using Python's file context manager to be sure you don't leave the file open when you exit.
  3. Read data from the dataset into a numpy array with standard numpy slicing notation. An empty tuple retrieves all data. Use the following with your schema: arr = file['Incidence_#']['Wavelength_#'][()]
  4. Create a new boolean array based on your specified criteria. Using arr < 0 will return True for all negative values (and False for other values).
  5. Use np.argwhere to find array element indices that are non-zero. Use this on the boolean array (remembering True is non-zero, and False is zero).
  6. From there, you can loop over the indices to extract the data you want. :-)

I created a simple example that mimics your schema to demonstrate the process. For completeness, that code is at the end. It's a small file that won't bury you in output.

Code below reads the data, finds negative values, and adds data to a list. It has several print statements so you can see how each step works. They aren't need once you are confident in the procedure.

with h5py.File('testFile.hdf5', 'r') as h5fr:
    result = []
    
    for i_grp in h5fr.keys():
        for wave_ds in h5fr[i_grp].keys():
            wave_arr = h5fr[i_grp][wave_ds][()]
            neg_idx = np.argwhere(wave_arr < 0.0)
            wave_res = []
            for n in neg_idx:
                i, j = n[0], n[1]
                result.append([wave_arr[i,j], i_grp, wave_ds, i, j])
                wave_res.append([wave_arr[i,j], i_grp, wave_ds, i, j])
            print(f'\nResults for {i_grp}; {wave_ds}:')    
            print(wave_res)    

Code to create the example file used above:

nbIncidence = 4
nbWavelength = 6
m, n = 10, 10 # m,n same as nbTheta, nbPhi??

with h5py.File('testFile.hdf5', 'w') as h5fw:
    for i_cnt in range(nbIncidence):
        grp = h5fw.create_group('Incidence_' + str(i_cnt))
        for w_cnt in range(nbWavelength):
            arr = np.random.uniform(low=-1.0, high=10.0, size=(m,n)) #.reshape(m,n)
            grp.create_dataset('Wavelength_' + str(w_cnt), data=arr)  

Upvotes: 1

Related Questions