Patrickens
Patrickens

Reputation: 323

Reading and writing numpy arrays to and from HDF5 files

I am building simulation software, and I need to write (thousands of) 2D numpy arrays into tables in an HDF5 file, where one dimension of the array is variable. The incoming array is of float32 type; to save disk space every array is stored as a table with appropriate data-types for the columns (hence not using arrays). When I read tables, I'd like to retrieve a numpy.ndarray of type float32, so I can do nice calculations for analysis. Below is example code with an array with species A,B, and C plus time.

The way I am currently reading and writing 'works' but it is very slow. The question is thus: what is the appropriate way of storing array into table fast, and also reading it back again into ndarrays? I have been experimenting with numpy.recarray, but I cannot get this to work (type errors, dimension errors, wholly wrong numbers etc.)?

Code:

import tables as pt
import numpy as np

# Variable dimension
var_dim=100

# Example array, rows 0 and 3 should be stored as float32, rows 1 and 2 as uint16
array=(np.random.random((4, var_dim)) * 100).astype(dtype=np.float32)

filename='test.hdf5'
hdf=pt.open_file(filename=filename,mode='w')
group=hdf.create_group(hdf.root,"group")

particle={
    'A':pt.Float32Col(),
    'B':pt.UInt16Col(),
    'C':pt.UInt16Col(),
    'time':pt.Float32Col(),
    }
dtypes=np.array([
    np.float32,
    np.uint16,
    np.uint16,
    np.float32
    ])

# This is the table to be stored in
table=hdf.create_table(group,'trajectory', description=particle, expectedrows=var_dim)

# My current way of storing
for i, row in enumerate(array.T):
    table.append([tuple([t(x) for t, x in zip(dtypes, row)])])
table.flush()
hdf.close()


hdf=pt.open_file(filename=filename,mode='r')
array_table=hdf.root.group._f_iter_nodes().__next__()

# My current way of reading
row_list = []
for i, row in enumerate(array_table.read()):
    row_list.append(np.array(list(row)))

#The retreived array
array=np.asarray(row_list).T


# I've tried something with a recarray
rec_array=array_table.read().view(type=np.recarray)

# This gives me errors, or wrong results
rec_array.view(dtype=np.float64)
hdf.close()

The error I get:

Traceback (most recent call last):
  File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 475, in __setattr__
    ret = object.__setattr__(self, attr, val)
ValueError: new type not compatible with array.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/thomas/Documents/Thesis/SO.py", line 53, in <module>
    rec_array.view(dtype=np.float64)
  File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 480, in __setattr__
    raise exctype(value)
ValueError: new type not compatible with array.
Closing remaining open files:test.hdf5...done

Upvotes: 1

Views: 4371

Answers (2)

hpaulj
hpaulj

Reputation: 231665

I haven't worked with tables, but have looked at its files with h5py. I'm guessing then that your array or recarray is a structured array with dtype like:

In [131]: dt=np.dtype('f4,u2,u2,f4')
In [132]: np.array(arr.tolist(), float)
Out[132]: 
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])
In [133]: arr
Out[133]: 
array([( 1., 1, 1,  1.), ( 1., 1, 1,  1.), ( 1., 1, 1,  1.)], 
      dtype=[('f0', '<f4'), ('f1', '<u2'), ('f2', '<u2'), ('f3', '<f4')])

Using @kazemakase's tolist approach (which I've recommended in other posts):

In [134]: np.array(arr.tolist(), float)
Out[134]: 
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

astype gets the shape all wrong

In [135]: arr.astype(np.float32)
Out[135]: array([ 1.,  1.,  1.], dtype=float32)

view works when component dtypes are uniform, for example with the 2 float fields

In [136]: arr[['f0','f3']].copy().view(np.float32)
Out[136]: array([ 1.,  1.,  1.,  1.,  1.,  1.], dtype=float32)

But it does require a reshape. view uses the databuffer bytes, just reinterpreting.

Many recfunctions functions use a field by field copy. Here the equivalent would be

In [138]: res = np.empty((3,4),'float32')
In [139]: for i in range(4):
     ...:     res[:,i] = arr[arr.dtype.names[i]]
     ...:     
In [140]: res
Out[140]: 
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]], dtype=float32)

If the number of fields is small compared to the number of records, this iteration is not expensive.


def foo(arr):
    res = np.empty((arr.shape[0],4), np.float32)
    for i in range(4):
        res[:,i] = arr[arr.dtype.names[i]]
    return res

With a large 4 field array, the by-field copy is clearly faster:

In [143]: arr = np.ones(10000, dtype=dt)
In [149]: timeit x1 = foo(arr)
10000 loops, best of 3: 73.5 µs per loop
In [150]: timeit x2 = np.array(arr.tolist(), np.float32)
100 loops, best of 3: 11.9 ms per loop

Upvotes: 1

MB-F
MB-F

Reputation: 23647

As a quick and dirty solution it is possible to aviod loops by temporarily converting the arrays to lists (if you can spare the memory). For some reason record arrays are readily converted to/from lists but not to/from conventional arrays.

Storing:

table.append(array.T.tolist())

Loading:

loaded_array = np.array(array_table.read().tolist(), dtype=np.float64).T

There should be a more "Numpythonic" approach to convert between record arrays and conventional arrays, but I'm not familiar enough with the former to know how.

Upvotes: 2

Related Questions