Reputation: 420
I have two unstructured NumPy arrays a
and b
with shapes (N,)
and (N, 256, 2)
respectively and dtype np.float
. I wish to combine these into a single structured array with shape (N,)
and dtype [('field1', np.float), ('field2', np.float, (256, 2))]
.
The documentation on this is surprisingly lacking. I've found methods like np.lib.recfunctions.merge_arrays
but have not been able to find the precise combination of features required to do this.
For the sake of avoiding the XY problem, I'll state my wider aims.
I have a PyTables table with layout {"field1": tables.FloatCol(), "field2": tables.FloatCol(shape = (256, 2))}
. The two NumPy arrays represent N new rows to be appended to each of these fields. N is large, so I wish to do this with a single efficient table.append(rows)
call, rather than the slow process of looping through table.row['field'] = ...
.
The table.append
documentation says
The rows argument may be any object which can be converted to a structured array compliant with the table structure (otherwise, a ValueError is raised). This includes NumPy structured arrays, lists of tuples or array records, and a string or Python buffer.
Converting my arrays to an appropriate structured array seems to be what I should be doing here. I'm looking for speed, and I anticipate the other options being slower.
Upvotes: 0
Views: 1519
Reputation: 8046
This answer builds on @hpualj's answer. His first method creates the obj
argument as a structured array and his second creates a record array. (This array would be the rows
argument when you append.) I like both of these methods to create or append to tables when I already have my data in a structured (or record) array. However, you don't have to do this if your data is in separate arrays (as stated under "avoiding the X-Y problem'). As noted in the PyTables doc for table.append()
:
The rows argument may be any object which can be converted to a structured array compliant with the table structure.... This includes NumPy structured arrays, lists of tuples or array records...
In other words, you can append with lists referencing your arrays, so long they match the table structure created with description=dt
in the example. (I think you are limited to structured arrays at creation.) This might simplify your code.
I wrote an example that builds on @hpaulj's code. It creates 2 identical HDF5 files with different methods.
_1.h5
) I create the table using the structured array method. I then add 3 rows of data to the table using table.append([list of arrays])
_2.h5
) I create the table referencing the
structured array dtype using description=dt
, but do not add data with obj=arr
. I then add the first 3 rows of data to the table using table.append([list of arrays])
and repeat to add 3 more rows.Example below:
import numpy as np
import tables as tb
dt = np.dtype([('field1', np.float), ('field2', np.float, (4, 2))])
arr = np.zeros(3, dt) # float display is prettier
arr['field1'] = np.arange(3)
arr['field2'] = np.arange(24).reshape(3,4,2)
with tb.File('SO_62104084_1.h5','w') as h5f1:
test_tb = h5f1.create_table('/','test',obj=arr)
arr1 = np.arange(13.,16.,1.)
arr2 = np.arange(124.,148.,1.).reshape(3,4,2)
# add rows of data referencing list of arrays:
test_tb.append([arr1,arr2])
with tb.File('SO_62104084_2.h5','w') as h5f2:
test_tb=h5f2.create_table('/','test', description=dt)
# add data rows 0-2:
arr1 = np.arange(3)
arr2 = np.arange(24).reshape(3,4,2)
test_tb.append([arr1,arr2])
# add data rows 3-5:
arr1 = np.arange(13.,16.,1.)
arr2 = np.arange(124.,148.,1.).reshape(3,4,2)
test_tb.append([arr1,arr2])
Upvotes: 0
Reputation: 31011
In order to have test printouts of decent size, my solution assumes:
To generate the result, proceed as follows:
Start from import numpy.lib.recfunctions as rfn
(will be needed soon).
Create source arrays:
a = np.array([10, 20, 30, 40, 50])
b = np.arange(1, 41).reshape(5, 4, 2)
Create the result:
result = rfn.unstructured_to_structured(
np.hstack((a[:,np.newaxis], b.reshape(-1,8))),
np.dtype([('field1', 'f4'), ('field2', 'f4', (4,2))]))
The generated array contains:
array([(10., [[ 1., 2.], [ 3., 4.], [ 5., 6.], [ 7., 8.]]),
(20., [[ 9., 10.], [11., 12.], [13., 14.], [15., 16.]]),
(30., [[17., 18.], [19., 20.], [21., 22.], [23., 24.]]),
(40., [[25., 26.], [27., 28.], [29., 30.], [31., 32.]]),
(50., [[33., 34.], [35., 36.], [37., 38.], [39., 40.]])],
dtype=[('field1', '<f4'), ('field2', '<f4', (4, 2))])
Note that the source array to unstructured_to_structured is created the following way:
During the above experiments I assumed type of f4, maybe you should change it to f8 (your decision).
In the target version of the code:
Upvotes: 0
Reputation: 231530
Define the dtype, and create an empty/zeros array:
In [163]: dt = np.dtype([('field1', np.float), ('field2', np.float, (4, 2))])
In [164]: arr = np.zeros(3, dt) # float display is prettier
In [165]: arr
Out[165]:
array([(0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]]),
(0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]]),
(0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]])],
dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])
Assign values field by field:
In [166]: arr['field1'] = np.arange(3)
In [167]: arr['field2'].shape
Out[167]: (3, 4, 2)
In [168]: arr['field2'] = np.arange(24).reshape(3,4,2)
In [169]: arr
Out[169]:
array([(0., [[ 0., 1.], [ 2., 3.], [ 4., 5.], [ 6., 7.]]),
(1., [[ 8., 9.], [10., 11.], [12., 13.], [14., 15.]]),
(2., [[16., 17.], [18., 19.], [20., 21.], [22., 23.]])],
dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])
np.rec
does have a function that works similarly:
In [174]: np.rec.fromarrays([np.arange(3.), np.arange(24).reshape(3,4,2)], dtype=dt)
Out[174]:
rec.array([(0., [[ 0., 1.], [ 2., 3.], [ 4., 5.], [ 6., 7.]]),
(1., [[ 8., 9.], [10., 11.], [12., 13.], [14., 15.]]),
(2., [[16., 17.], [18., 19.], [20., 21.], [22., 23.]])],
dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])
This is the same, except fields can be accessed as attributes (as well). Under the covers it does the same by-field assignment.
numpy.lib.recfunctions
is another collection of structured array functions. These too mostly follow the by-field assignment approach.
Upvotes: 1