Reading and writing numpy arrays to and from HDF5 files

Question

I am building simulation software, and I need to write (thousands of) 2D numpy arrays into tables in an HDF5 file, where one dimension of the array is variable. The incoming array is of float32 type; to save disk space every array is stored as a table with appropriate data-types for the columns (hence not using arrays). When I read tables, I'd like to retrieve a numpy.ndarray of type float32, so I can do nice calculations for analysis. Below is example code with an array with species A,B, and C plus time.

The way I am currently reading and writing 'works' but it is very slow. The question is thus: what is the appropriate way of storing array into table fast, and also reading it back again into ndarrays? I have been experimenting with numpy.recarray, but I cannot get this to work (type errors, dimension errors, wholly wrong numbers etc.)?

Code:

import tables as pt
import numpy as np

# Variable dimension
var_dim=100

# Example array, rows 0 and 3 should be stored as float32, rows 1 and 2 as uint16
array=(np.random.random((4, var_dim)) * 100).astype(dtype=np.float32)

filename='test.hdf5'
hdf=pt.open_file(filename=filename,mode='w')
group=hdf.create_group(hdf.root,"group")

particle={
    'A':pt.Float32Col(),
    'B':pt.UInt16Col(),
    'C':pt.UInt16Col(),
    'time':pt.Float32Col(),
    }
dtypes=np.array([
    np.float32,
    np.uint16,
    np.uint16,
    np.float32
    ])

# This is the table to be stored in
table=hdf.create_table(group,'trajectory', description=particle, expectedrows=var_dim)

# My current way of storing
for i, row in enumerate(array.T):
    table.append([tuple([t(x) for t, x in zip(dtypes, row)])])
table.flush()
hdf.close()


hdf=pt.open_file(filename=filename,mode='r')
array_table=hdf.root.group._f_iter_nodes().__next__()

# My current way of reading
row_list = []
for i, row in enumerate(array_table.read()):
    row_list.append(np.array(list(row)))

#The retreived array
array=np.asarray(row_list).T


# I've tried something with a recarray
rec_array=array_table.read().view(type=np.recarray)

# This gives me errors, or wrong results
rec_array.view(dtype=np.float64)
hdf.close()

The error I get:

Traceback (most recent call last):
  File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 475, in __setattr__
    ret = object.__setattr__(self, attr, val)
ValueError: new type not compatible with array.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/thomas/Documents/Thesis/SO.py", line 53, in 
    rec_array.view(dtype=np.float64)
  File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 480, in __setattr__
    raise exctype(value)
ValueError: new type not compatible with array.
Closing remaining open files:test.hdf5...done

MB-F · Accepted Answer

As a quick and dirty solution it is possible to aviod loops by temporarily converting the arrays to lists (if you can spare the memory). For some reason record arrays are readily converted to/from lists but not to/from conventional arrays.

Storing:

table.append(array.T.tolist())

Loading:

loaded_array = np.array(array_table.read().tolist(), dtype=np.float64).T

There should be a more "Numpythonic" approach to convert between record arrays and conventional arrays, but I'm not familiar enough with the former to know how.

Reading and writing numpy arrays to and from HDF5 files

Answers (2)

Related Questions