Reputation: 323
I am building simulation software, and I need to write (thousands of) 2D numpy arrays into tables in an HDF5 file, where one dimension of the array is variable. The incoming array
is of float32 type; to save disk space every array is stored as a table with appropriate data-types for the columns (hence not using arrays). When I read tables, I'd like to retrieve a numpy.ndarray of type float32, so I can do nice calculations for analysis. Below is example code with an array with species A,B, and C plus time.
The way I am currently reading and writing 'works' but it is very slow. The question is thus: what is the appropriate way of storing array
into table
fast, and also reading it back again into ndarrays? I have been experimenting with numpy.recarray, but I cannot get this to work (type errors, dimension errors, wholly wrong numbers etc.)?
Code:
import tables as pt
import numpy as np
# Variable dimension
var_dim=100
# Example array, rows 0 and 3 should be stored as float32, rows 1 and 2 as uint16
array=(np.random.random((4, var_dim)) * 100).astype(dtype=np.float32)
filename='test.hdf5'
hdf=pt.open_file(filename=filename,mode='w')
group=hdf.create_group(hdf.root,"group")
particle={
'A':pt.Float32Col(),
'B':pt.UInt16Col(),
'C':pt.UInt16Col(),
'time':pt.Float32Col(),
}
dtypes=np.array([
np.float32,
np.uint16,
np.uint16,
np.float32
])
# This is the table to be stored in
table=hdf.create_table(group,'trajectory', description=particle, expectedrows=var_dim)
# My current way of storing
for i, row in enumerate(array.T):
table.append([tuple([t(x) for t, x in zip(dtypes, row)])])
table.flush()
hdf.close()
hdf=pt.open_file(filename=filename,mode='r')
array_table=hdf.root.group._f_iter_nodes().__next__()
# My current way of reading
row_list = []
for i, row in enumerate(array_table.read()):
row_list.append(np.array(list(row)))
#The retreived array
array=np.asarray(row_list).T
# I've tried something with a recarray
rec_array=array_table.read().view(type=np.recarray)
# This gives me errors, or wrong results
rec_array.view(dtype=np.float64)
hdf.close()
The error I get:
Traceback (most recent call last):
File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 475, in __setattr__
ret = object.__setattr__(self, attr, val)
ValueError: new type not compatible with array.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/thomas/Documents/Thesis/SO.py", line 53, in <module>
rec_array.view(dtype=np.float64)
File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 480, in __setattr__
raise exctype(value)
ValueError: new type not compatible with array.
Closing remaining open files:test.hdf5...done
Upvotes: 1
Views: 4371
Reputation: 231665
I haven't worked with tables
, but have looked at its files with h5py
. I'm guessing then that your array
or recarray
is a structured array with dtype like:
In [131]: dt=np.dtype('f4,u2,u2,f4')
In [132]: np.array(arr.tolist(), float)
Out[132]:
array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
In [133]: arr
Out[133]:
array([( 1., 1, 1, 1.), ( 1., 1, 1, 1.), ( 1., 1, 1, 1.)],
dtype=[('f0', '<f4'), ('f1', '<u2'), ('f2', '<u2'), ('f3', '<f4')])
Using @kazemakase's
tolist
approach (which I've recommended in other posts):
In [134]: np.array(arr.tolist(), float)
Out[134]:
array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
astype
gets the shape all wrong
In [135]: arr.astype(np.float32)
Out[135]: array([ 1., 1., 1.], dtype=float32)
view
works when component dtypes are uniform, for example with the 2 float fields
In [136]: arr[['f0','f3']].copy().view(np.float32)
Out[136]: array([ 1., 1., 1., 1., 1., 1.], dtype=float32)
But it does require a reshape. view
uses the databuffer bytes, just reinterpreting.
Many recfunctions
functions use a field by field copy. Here the equivalent would be
In [138]: res = np.empty((3,4),'float32')
In [139]: for i in range(4):
...: res[:,i] = arr[arr.dtype.names[i]]
...:
In [140]: res
Out[140]:
array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]], dtype=float32)
If the number of fields is small compared to the number of records, this iteration is not expensive.
def foo(arr):
res = np.empty((arr.shape[0],4), np.float32)
for i in range(4):
res[:,i] = arr[arr.dtype.names[i]]
return res
With a large 4 field array, the by-field copy is clearly faster:
In [143]: arr = np.ones(10000, dtype=dt)
In [149]: timeit x1 = foo(arr)
10000 loops, best of 3: 73.5 µs per loop
In [150]: timeit x2 = np.array(arr.tolist(), np.float32)
100 loops, best of 3: 11.9 ms per loop
Upvotes: 1
Reputation: 23647
As a quick and dirty solution it is possible to aviod loops by temporarily converting the arrays to lists (if you can spare the memory). For some reason record arrays are readily converted to/from lists but not to/from conventional arrays.
Storing:
table.append(array.T.tolist())
Loading:
loaded_array = np.array(array_table.read().tolist(), dtype=np.float64).T
There should be a more "Numpythonic" approach to convert between record arrays and conventional arrays, but I'm not familiar enough with the former to know how.
Upvotes: 2