pythonpython-2.7numpystructured-arrayrecarray

Reputation: 4102

numpy: How to add a column to an existing structured array?

I have a starting array such as:

[(1, [-112.01268501699997, 40.64249414272372])
 (2, [-111.86145708699996, 40.4945008710162])]

The first column is an int and the second is a list of floats. I need to add a str column called 'USNG'.

I then create a structured numpy array, as such:

dtype = numpy.dtype([('USNG', '|S100')])
x = numpy.empty(array.shape, dtype=dtype)

I want to append the x numpy array to the existing array as a new column, so I can output some information to that column for each row.

When I do the following:

numpy.append(array, x, axis=1)

I get the following error:

'TypeError: invalid type promotion'

I've also tried vstack and hstack

Upvotes: 18

Answers (7)

Diego Alonso

Reputation: 133

Here's a function that implements Warren's solution:

def happend(x, col_data,col_name:str):
    if not x.dtype.fields:  return None                                     # Not a structured array
    y = np.empty(x.shape, dtype=x.dtype.descr+[(col_name,col_data.dtype)])  # 0) create new structured array
    for name in x.dtype.fields.keys():  y[name] = x[name]                   # 1) copy old array
    y[col_name] = col_data                                                  # 2) copy new column
    return y

y = happend(x, np.arange(x.shape[0]),'idx')  # assuming `x` is a structured array

Upvotes: 2

the-citto

Reputation: 137

with 2mil+ arrays to work with, I immediately noticed a big difference between Warren Weckesser's solution and Tonsic's ones (thank you very much both)

with

first_array
[out]
array([(1633046400299000, 1.34707, 1.34748),
       (1633046400309000, 1.347  , 1.34748),
       (1633046400923000, 1.347  , 1.34749), ...,
       (1635551693846000, 1.36931, 1.36958),
       (1635551693954000, 1.36925, 1.36952),
       (1635551697902000, 1.3692 , 1.36947)],
      dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8')])

and

second_array
[out]
array([('2021-10-01T00:00:00.299000',), ('2021-10-01T00:00:00.309000',),
       ('2021-10-01T00:00:00.923000',), ...,
       ('2021-10-29T23:54:53.846000',), ('2021-10-29T23:54:53.954000',),
       ('2021-10-29T23:54:57.902000',)], dtype=[('date_time', '<M8[us]')])

I get

%timeit rfn.merge_arrays((first_array, second_array), flatten=True)
[out]
13.8 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

and

%timeit rfn.append_fields(first_array, 'date_time', second_array, dtypes='M8[us]').data
[out]
2.12 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

much better (and notice .data at the end to avoid getting mask and fill_value)

whereas using something like

def building_new(first_array, other_array):
    new_array = np.zeros(
        first_array.size, 
        dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])
    new_array[['timestamp', 'bid', 'ask']] = first_array[['timestamp', 'bid', 'ask']]
    new_array['date_time'] = other_array
    return new_array

(notice that in a structured array every row is a tuple, so size works nicely)

I get

%timeit building_new(first_array, second_array)
[out]
67.2 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

the output of all three is the same

[out]
array([(1633046400299000, 1.34707, 1.34748, '2021-10-01T00:00:00.299000'),
       (1633046400309000, 1.347  , 1.34748, '2021-10-01T00:00:00.309000'),
       (1633046400923000, 1.347  , 1.34749, '2021-10-01T00:00:00.923000'),
       ...,
       (1635551693846000, 1.36931, 1.36958, '2021-10-29T23:54:53.846000'),
       (1635551693954000, 1.36925, 1.36952, '2021-10-29T23:54:53.954000'),
       (1635551697902000, 1.3692 , 1.36947, '2021-10-29T23:54:57.902000')],
      dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])

a final thought:

creating the new array instead of the recfunctions, the second array doesn't even need to be a structured one

third_array
[out]
array(['2021-10-01T00:00:00.299000', '2021-10-01T00:00:00.309000',
       '2021-10-01T00:00:00.923000', ..., '2021-10-29T23:54:53.846000',
       '2021-10-29T23:54:53.954000', '2021-10-29T23:54:57.902000'],
      dtype='datetime64[us]')

%timeit building_new(first_array, third_array)
[out]
67 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 6

Tonsic

Reputation: 1138

Have you tried using numpy's recfunctions?

import numpy.lib.recfunctions as rfn

It has some very useful functions for structured arrays.

For your case, I think it could be accomplished with:

a = rfn.append_fields(a, 'USNG', np.empty(a.shape[0], dtype='|S100'), dtypes='|S100')

Tested here and it worked.

merge_arrays

As GMSL mentioned in the comments. It is possible to do that with rfn.merge_arrays like below:

a = np.array([(1, [-112.01268501699997, 40.64249414272372]),
       (2, [-111.86145708699996, 40.4945008710162])], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,))])
a2 = np.full(a.shape[0], '', dtype=[('USNG', '|S100')])
a3 = rfn.merge_arrays((a, a2), flatten=True)

a3 will have the value:

array([(1, [-112.01268502,   40.64249414], b''),
       (2, [-111.86145709,   40.49450087], b'')],
      dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])

Upvotes: 6

Trenton McKinney

Reputation: 62403

If pandas is an option, it makes adding a column to a recarray, much easier.
- Additionally, the data will be in a form that's easily analyzed
- numpy is a pandas dependency, and makes many operations easier.
- Also see How to add a column to numpy recarry as another example.

Read the current recarray with pandas.DataFrame or pandas.DataFrame.from_records.
Add the new column of data to the dataframe
Export the dataframe to a recarray with pandas.DataFrame.to_records

import pandas as pd
import numpy as np

# current recarray
data = np.rec.array([(1, list([-112.01268501699997, 40.64249414272372])), (2, list([-111.86145708699996, 40.4945008710162]))], dtype=[('i', '<i8'), ('loc', 'O')])

# create dataframe
df = pd.DataFrame(data)

# display(df)
   i                                       loc
0  1  [-112.01268501699997, 40.64249414272372]
1  2   [-111.86145708699996, 40.4945008710162]

# add new column
df['USNG'] = ['Note 1', 'Note 2']

# display(df)
   i                                       loc    USNG
0  1  [-112.01268501699997, 40.64249414272372]  Note 1
1  2   [-111.86145708699996, 40.4945008710162]  Note 2

# write the dataframe to recarray
data = df.to_records(index=False)

print(data)
[out]:
rec.array([(1, list([-112.01268501699997, 40.64249414272372]), 'Note 1'),
           (2, list([-111.86145708699996, 40.4945008710162]), 'Note 2')],
          dtype=[('i', '<i8'), ('loc', 'O'), ('USNG', 'O')])

Upvotes: 2

GMSL

Reputation: 425

Tonsic mentioned the recfunctions by import numpy.lib.recfunctions as rfn. In this case, a simpler recfunction function that would work for you is rfn.merge_arrays() (docs).

Upvotes: 0

Mike O'Connor

Reputation: 2710

The question is precisely: "Any suggestions on why this is happening?"

Fundamentally, this is a bug--- it's been an open ticket at numpy since 2012.

Upvotes: 1

Warren Weckesser

Reputation: 114811

You have to create a new dtype that contains the new field.

For example, here's a:

In [86]: a
Out[86]: 
array([(1, [-112.01268501699997, 40.64249414272372]),
       (2, [-111.86145708699996, 40.4945008710162])], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,))])

a.dtype.descr is [('i', '<i8'), ('loc', '<f8', (2,))]; i.e. a list of field types. We'll create a new dtype by adding ('USNG', 'S100') to the end of that list:

In [87]: new_dt = np.dtype(a.dtype.descr + [('USNG', 'S100')])

Now create a new structured array, b. I used zeros here, so the string fields will start out with the value ''. You could also use empty. The strings will then contain garbage, but that won't matter if you immediately assign values to them.

In [88]: b = np.zeros(a.shape, dtype=new_dt)

Copy over the existing data from a to b:

In [89]: b['i'] = a['i']

In [90]: b['loc'] = a['loc']

Here's b now:

In [91]: b
Out[91]: 
array([(1, [-112.01268501699997, 40.64249414272372], ''),
       (2, [-111.86145708699996, 40.4945008710162], '')], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])

Fill in the new field with some data:

In [93]: b['USNG'] = ['FOO', 'BAR']

In [94]: b
Out[94]: 
array([(1, [-112.01268501699997, 40.64249414272372], 'FOO'),
       (2, [-111.86145708699996, 40.4945008710162], 'BAR')], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])

Upvotes: 19

numpy: How to add a column to an existing structured array?

Answers (7)

merge_arrays

Related Questions