Reputation: 47

How extract data from HDF5 in python?

I have the following HDF5 file which I could extract a list ['model_cints'] inside data, however, I don't know of to show the data within the list data.

https://drive.google.com/drive/folders/1p0J7X4n7A39lHZpCAvv_cw3u-JUZ4WFU?usp=sharing

I've tried using numpy.array using this code but I get these messages:

npa = np.asarray(data, dtype=np.float32)

 
ValueError: could not convert string to float: 'model_cints'


npa = np.asarray(data)

npa
Out[54]: array(['model_cints'], dtype='<U11')

This is the code:import h5py

filename = "example.hdf5"

with h5py.File(filename, "r") as f:
    # List all groups
    print("Keys: %s" % f.keys())
    a_group_key = list(f.keys())[0]

    # Get the data
    data = list(f[a_group_key])

The data is inside ['model_cints']

Upvotes: 1

Answers (2)

kcw78

Reputation: 8046

If you are new to HDF5, I suggest a "crawl, walk, run" approach to understand the HDF5 data model, your specific data schema, and how to use the various APIs (including h5py and PyTables). HDF5 is designed to be self-describing. In other words, you can figure out the schema by inspection. Understanding the schema is the key to working with your data. Coding before you understand the schema is incredibly frustrating (been there, done that).

I suggest new users start with HDFView from The HDF Group. This is a utility to view the data in a GUI without writing code. And, once you start writing code, it's helpful to visually verify you read the data correctly.

Next, learn how to traverse the data structure. In h5py, you can do this with the visititems() method. I recently wrote a SO Answer with an example. See this answer: SO 65793692: visititems() method to recursively walk nodes

In your case, it sounds like you only need to read the data in a dataset defined by this path: '[data/model_cints]' or '[data][model_cints]'. Both are valid path definitions. ('data' is a Group and 'model_cints' is a Dataset. Groups are similar to Folders/Directories and Datasets are like files.)

Once you have a dataset path, you need to get the data type (like NumPy dtype). You get this (and the shape attribute) with h5py the same way you do with NumPy. This is what I get for your dtype:
[('fs_date', '<f8'), ('date', '<f8'), ('prob', 'i1'), ('ymin', '<f8'), ('ymax', '<f8'), ('type', 'O'), ('name', 'O')]

What you have is an array of mixed type: 4 floats, 1 int, and 2 strings. This is extracted as a NumPy record array (or recarray). This is different than a typical ndarray where all elements are the same type (all ints, or floats or strings). You access the data with row indices (integers) and/or field names (although can also use column indices).

I pulled all of this together in the code below. It shows different methods to access the data. (Hopefully the multiple methods don't confuse this explanation.) Each are useful depending on how you want to read the data.

Note: This data looks like results from several tests combined into a single file. If you want to query for particular test values, you should investigate PyTables. It has some powerful search capabilities not available in h5py to simplify that task. Good luck.

with h5py.File("example.hdf5", "r") as h5f:
    # Get a h5py dataset object
    data_ds = h5f['data']['model_cints']
    print ('data_ds dtype:', data_ds.dtype, '\nshape:', data_ds.shape)

    # get an array with all fs_date data only
    fs_date_arr = data_ds[:]['fs_date'] 
    print ('fs_date_arr dtype:', fs_date_arr.dtype, '\nshape:', fs_date_arr.shape)

    # Get the entire dataset as 1 numpy record array 
    data_arr_all = h5f['data']['model_cints'][:]
    # this also works:
    data_arr_all = data_ds[:]
    print ('data_arr_all dtype:', data_arr_all.dtype, '\nshape:', data_arr_all.shape)

    # Get the first 6 rows as 1 numpy record array 
    data_arr6 = h5f['data']['model_cints'][0:6][:]
    # this also works:
    data_arr6  = data_ds[0:6][:]
    print ('data_arr6 dtype:', data_arr6.dtype, '\nshape:', data_arr6.shape)

Upvotes: 3

Reti43

Reputation: 9797

f['data'] is a Group object, which means it has keys. When you make an iterable out of it, e.g., list(f['data']), or you iterate it, for something in f['data']:, you're going to get its keys, of which it has one. This explains

>>> np.array(f['data'])
array(['model_cints'], dtype='<U11')

What you want instead is

data = np.array(f['data']['model_cints'])

Upvotes: 0

How extract data from HDF5 in python?

Answers (2)

Related Questions