Sam
Sam

Reputation: 420

Combine two NumPy arrays into one structured array for appending to a PyTables table

I have two unstructured NumPy arrays a and b with shapes (N,) and (N, 256, 2) respectively and dtype np.float. I wish to combine these into a single structured array with shape (N,) and dtype [('field1', np.float), ('field2', np.float, (256, 2))].

The documentation on this is surprisingly lacking. I've found methods like np.lib.recfunctions.merge_arrays but have not been able to find the precise combination of features required to do this.


For the sake of avoiding the XY problem, I'll state my wider aims.

I have a PyTables table with layout {"field1": tables.FloatCol(), "field2": tables.FloatCol(shape = (256, 2))}. The two NumPy arrays represent N new rows to be appended to each of these fields. N is large, so I wish to do this with a single efficient table.append(rows) call, rather than the slow process of looping through table.row['field'] = ....

The table.append documentation says

The rows argument may be any object which can be converted to a structured array compliant with the table structure (otherwise, a ValueError is raised). This includes NumPy structured arrays, lists of tuples or array records, and a string or Python buffer.

Converting my arrays to an appropriate structured array seems to be what I should be doing here. I'm looking for speed, and I anticipate the other options being slower.

Upvotes: 0

Views: 1519

Answers (3)

kcw78
kcw78

Reputation: 8046

This answer builds on @hpualj's answer. His first method creates the obj argument as a structured array and his second creates a record array. (This array would be the rows argument when you append.) I like both of these methods to create or append to tables when I already have my data in a structured (or record) array. However, you don't have to do this if your data is in separate arrays (as stated under "avoiding the X-Y problem'). As noted in the PyTables doc for table.append():

The rows argument may be any object which can be converted to a structured array compliant with the table structure.... This includes NumPy structured arrays, lists of tuples or array records...

In other words, you can append with lists referencing your arrays, so long they match the table structure created with description=dt in the example. (I think you are limited to structured arrays at creation.) This might simplify your code.

I wrote an example that builds on @hpaulj's code. It creates 2 identical HDF5 files with different methods.

  • For the first file (_1.h5) I create the table using the structured array method. I then add 3 rows of data to the table using table.append([list of arrays])
  • For the second file (_2.h5) I create the table referencing the structured array dtype using description=dt, but do not add data with obj=arr. I then add the first 3 rows of data to the table using table.append([list of arrays]) and repeat to add 3 more rows.

Example below:

import numpy as np
import tables as tb

dt = np.dtype([('field1', np.float), ('field2', np.float, (4, 2))])            
arr = np.zeros(3, dt)     # float display is prettier                                                          
arr['field1'] = np.arange(3)                                                                                                           
arr['field2'] = np.arange(24).reshape(3,4,2)                                   

with tb.File('SO_62104084_1.h5','w') as h5f1:
    test_tb = h5f1.create_table('/','test',obj=arr)
    arr1 = np.arange(13.,16.,1.)                                                                                                           
    arr2 = np.arange(124.,148.,1.).reshape(3,4,2)          
# add rows of data referencing list of arrays: 
    test_tb.append([arr1,arr2])

with tb.File('SO_62104084_2.h5','w') as h5f2:
    test_tb=h5f2.create_table('/','test', description=dt)
    # add data rows 0-2:  
    arr1 = np.arange(3)                                                                                                           
    arr2 = np.arange(24).reshape(3,4,2)                                   
    test_tb.append([arr1,arr2])
# add data rows 3-5:   
    arr1 = np.arange(13.,16.,1.)                                                                                                           
    arr2 = np.arange(124.,148.,1.).reshape(3,4,2)          
    test_tb.append([arr1,arr2])

Upvotes: 0

Valdi_Bo
Valdi_Bo

Reputation: 31011

In order to have test printouts of decent size, my solution assumes:

  • N = 5,
  • the second dimension - only 4 (instead of your 256).

To generate the result, proceed as follows:

  1. Start from import numpy.lib.recfunctions as rfn (will be needed soon).

  2. Create source arrays:

    a = np.array([10, 20, 30, 40, 50])
    b = np.arange(1, 41).reshape(5, 4, 2)
    
  3. Create the result:

    result = rfn.unstructured_to_structured(
        np.hstack((a[:,np.newaxis], b.reshape(-1,8))),
        np.dtype([('field1', 'f4'), ('field2', 'f4', (4,2))]))
    

The generated array contains:

array([(10., [[ 1.,  2.], [ 3.,  4.], [ 5.,  6.], [ 7.,  8.]]),
       (20., [[ 9., 10.], [11., 12.], [13., 14.], [15., 16.]]),
       (30., [[17., 18.], [19., 20.], [21., 22.], [23., 24.]]),
       (40., [[25., 26.], [27., 28.], [29., 30.], [31., 32.]]),
       (50., [[33., 34.], [35., 36.], [37., 38.], [39., 40.]])],
      dtype=[('field1', '<f4'), ('field2', '<f4', (4, 2))])

Note that the source array to unstructured_to_structured is created the following way:

  • Column 0 - from a (converted to a column),
  • Remaining colums - from b reshaped in such a way that all elements of the respective 4 * 2 slice are converted to a single row. Data from each row (from these columns) are converted back to "4 * 2" shape by this function.
  • Both the above components are assembled with hstack.

During the above experiments I assumed type of f4, maybe you should change it to f8 (your decision).

In the target version of the code:

  • change 4 in the first dimension of field2 to 256,
  • change 8 in b.reshape to 512 (= 2 * 256).

Upvotes: 0

hpaulj
hpaulj

Reputation: 231530

Define the dtype, and create an empty/zeros array:

In [163]: dt = np.dtype([('field1', np.float), ('field2', np.float, (4, 2))])            
In [164]: arr = np.zeros(3, dt)     # float display is prettier                                                          
In [165]: arr                                                                            
Out[165]: 
array([(0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]]),
       (0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]]),
       (0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]])],
      dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])

Assign values field by field:

In [166]: arr['field1'] = np.arange(3)                                                   
In [167]: arr['field2'].shape                                                            
Out[167]: (3, 4, 2)
In [168]: arr['field2'] = np.arange(24).reshape(3,4,2)                                   
In [169]: arr                                                                            
Out[169]: 
array([(0., [[ 0.,  1.], [ 2.,  3.], [ 4.,  5.], [ 6.,  7.]]),
       (1., [[ 8.,  9.], [10., 11.], [12., 13.], [14., 15.]]),
       (2., [[16., 17.], [18., 19.], [20., 21.], [22., 23.]])],
      dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])

np.rec does have a function that works similarly:

In [174]: np.rec.fromarrays([np.arange(3.), np.arange(24).reshape(3,4,2)], dtype=dt)     
Out[174]: 
rec.array([(0., [[ 0.,  1.], [ 2.,  3.], [ 4.,  5.], [ 6.,  7.]]),
           (1., [[ 8.,  9.], [10., 11.], [12., 13.], [14., 15.]]),
           (2., [[16., 17.], [18., 19.], [20., 21.], [22., 23.]])],
          dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])

This is the same, except fields can be accessed as attributes (as well). Under the covers it does the same by-field assignment.

numpy.lib.recfunctions is another collection of structured array functions. These too mostly follow the by-field assignment approach.

Upvotes: 1

Related Questions