Split numpy recarray based on value in one column

Question

my real data has some 10000+ items. I have a complicated numpy record array of a format roughly like:

a = (((1., 2., 3.), 4., 'metadata1'), 
     ((1., 3., 5.), 5., 'metadata1'), 
     ((1., 2., 4.), 5., 'metadata2'),
     ((1., 2., 5.), 5., 'metadata2'),  
     ((1., 3., 8.), 5., 'metadata3'))

My columns are defined by dtype = [('coords', '3f4'), ('values', 'f4'), ('meta', 'S10')]. I get a list of all my possible meta values by doing set(a['meta']).

And I'd like to split it into smaller lists based on the 'meta' column. Ideally, I'd like results like:

a['metadata1'] == (((1., 2., 3.), 4.), ((1., 3., 5.), 5.))
a['metadata2'] == (((1., 2., 4.), 5.), ((1., 2., 5.), 5.))
a['metadata3'] == (((1., 3., 8.), 5.))

or

a[0] = (((1., 2., 3.), 4., 'metadata1'), ((1., 3., 5.), 5., 'metadata1'))
a[1] = (((1., 2., 4.), 5., 'metadata2'), ((1., 2., 5.), 5., 'metadata2'))
a[2] = (((1., 3., 8.), 5., 'metadata3'))

or any other conveniently split format.

Although, for a large dataset, the former is better on memory. Any ideas on how to do this split? I've seen some other questions here, but they are all testing for numerical values.

ebarr · Accepted Answer

You can always access those rows easily using fancy indexing:

In [34]: a[a['meta']=='metadata2']
Out[34]: 
rec.array([(array([ 1.,  2.,  4.], dtype=float32), 5.0, 'metadata2'),
           (array([ 1.,  2.,  5.], dtype=float32), 5.0, 'metadata2')], 
          dtype=[('coords', '



You can use this approach to create lookup dictionary for the different meta types:

meta_dict = {}
for meta_type in np.unique(a['meta']):
    meta_dict[meta_type] = a[a['meta']==meta_type]


This will be very inefficient if there are a large number of meta types.

A more efficient solution might be to look into using a Pandas dataframe. These have a group by functionality that performs exactly the task you describe.

Split numpy recarray based on value in one column

Answers (1)

Related Questions