Reputation: 4102
I have a starting array such as:
[(1, [-112.01268501699997, 40.64249414272372])
(2, [-111.86145708699996, 40.4945008710162])]
The first column is an int
and the second is a list
of floats
. I need to add a str
column called 'USNG'
.
I then create a structured numpy array, as such:
dtype = numpy.dtype([('USNG', '|S100')])
x = numpy.empty(array.shape, dtype=dtype)
I want to append the x
numpy array to the existing array as a new column, so I can output some information to that column for each row.
When I do the following:
numpy.append(array, x, axis=1)
I get the following error:
'TypeError: invalid type promotion'
I've also tried vstack and hstack
Upvotes: 18
Views: 15603
Reputation: 133
Here's a function that implements Warren's solution:
def happend(x, col_data,col_name:str):
if not x.dtype.fields: return None # Not a structured array
y = np.empty(x.shape, dtype=x.dtype.descr+[(col_name,col_data.dtype)]) # 0) create new structured array
for name in x.dtype.fields.keys(): y[name] = x[name] # 1) copy old array
y[col_name] = col_data # 2) copy new column
return y
y = happend(x, np.arange(x.shape[0]),'idx') # assuming `x` is a structured array
Upvotes: 2
Reputation: 137
with 2mil+ arrays to work with, I immediately noticed a big difference between Warren Weckesser's solution and Tonsic's ones (thank you very much both)
with
first_array
[out]
array([(1633046400299000, 1.34707, 1.34748),
(1633046400309000, 1.347 , 1.34748),
(1633046400923000, 1.347 , 1.34749), ...,
(1635551693846000, 1.36931, 1.36958),
(1635551693954000, 1.36925, 1.36952),
(1635551697902000, 1.3692 , 1.36947)],
dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8')])
and
second_array
[out]
array([('2021-10-01T00:00:00.299000',), ('2021-10-01T00:00:00.309000',),
('2021-10-01T00:00:00.923000',), ...,
('2021-10-29T23:54:53.846000',), ('2021-10-29T23:54:53.954000',),
('2021-10-29T23:54:57.902000',)], dtype=[('date_time', '<M8[us]')])
I get
%timeit rfn.merge_arrays((first_array, second_array), flatten=True)
[out]
13.8 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
and
%timeit rfn.append_fields(first_array, 'date_time', second_array, dtypes='M8[us]').data
[out]
2.12 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
much better (and notice .data
at the end to avoid getting mask
and fill_value
)
whereas using something like
def building_new(first_array, other_array):
new_array = np.zeros(
first_array.size,
dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])
new_array[['timestamp', 'bid', 'ask']] = first_array[['timestamp', 'bid', 'ask']]
new_array['date_time'] = other_array
return new_array
(notice that in a structured array every row is a tuple, so size works nicely)
I get
%timeit building_new(first_array, second_array)
[out]
67.2 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
the output of all three is the same
[out]
array([(1633046400299000, 1.34707, 1.34748, '2021-10-01T00:00:00.299000'),
(1633046400309000, 1.347 , 1.34748, '2021-10-01T00:00:00.309000'),
(1633046400923000, 1.347 , 1.34749, '2021-10-01T00:00:00.923000'),
...,
(1635551693846000, 1.36931, 1.36958, '2021-10-29T23:54:53.846000'),
(1635551693954000, 1.36925, 1.36952, '2021-10-29T23:54:53.954000'),
(1635551697902000, 1.3692 , 1.36947, '2021-10-29T23:54:57.902000')],
dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])
a final thought:
creating the new array instead of the recfunctions, the second array doesn't even need to be a structured one
third_array
[out]
array(['2021-10-01T00:00:00.299000', '2021-10-01T00:00:00.309000',
'2021-10-01T00:00:00.923000', ..., '2021-10-29T23:54:53.846000',
'2021-10-29T23:54:53.954000', '2021-10-29T23:54:57.902000'],
dtype='datetime64[us]')
%timeit building_new(first_array, third_array)
[out]
67 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 6
Reputation: 1138
Have you tried using numpy's recfunctions?
import numpy.lib.recfunctions as rfn
It has some very useful functions for structured arrays.
For your case, I think it could be accomplished with:
a = rfn.append_fields(a, 'USNG', np.empty(a.shape[0], dtype='|S100'), dtypes='|S100')
Tested here and it worked.
As GMSL mentioned in the comments. It is possible to do that with rfn.merge_arrays like below:
a = np.array([(1, [-112.01268501699997, 40.64249414272372]),
(2, [-111.86145708699996, 40.4945008710162])],
dtype=[('i', '<i8'), ('loc', '<f8', (2,))])
a2 = np.full(a.shape[0], '', dtype=[('USNG', '|S100')])
a3 = rfn.merge_arrays((a, a2), flatten=True)
a3 will have the value:
array([(1, [-112.01268502, 40.64249414], b''),
(2, [-111.86145709, 40.49450087], b'')],
dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])
Upvotes: 6
Reputation: 62403
recarray
, much easier.
recarray
with pandas.DataFrame
or pandas.DataFrame.from_records
.recarray
with pandas.DataFrame.to_records
import pandas as pd
import numpy as np
# current recarray
data = np.rec.array([(1, list([-112.01268501699997, 40.64249414272372])), (2, list([-111.86145708699996, 40.4945008710162]))], dtype=[('i', '<i8'), ('loc', 'O')])
# create dataframe
df = pd.DataFrame(data)
# display(df)
i loc
0 1 [-112.01268501699997, 40.64249414272372]
1 2 [-111.86145708699996, 40.4945008710162]
# add new column
df['USNG'] = ['Note 1', 'Note 2']
# display(df)
i loc USNG
0 1 [-112.01268501699997, 40.64249414272372] Note 1
1 2 [-111.86145708699996, 40.4945008710162] Note 2
# write the dataframe to recarray
data = df.to_records(index=False)
print(data)
[out]:
rec.array([(1, list([-112.01268501699997, 40.64249414272372]), 'Note 1'),
(2, list([-111.86145708699996, 40.4945008710162]), 'Note 2')],
dtype=[('i', '<i8'), ('loc', 'O'), ('USNG', 'O')])
Upvotes: 2
Reputation: 425
Tonsic mentioned the recfunctions by import numpy.lib.recfunctions as rfn
. In this case, a simpler recfunction function that would work for you is rfn.merge_arrays()
(docs).
Upvotes: 0
Reputation: 2710
The question is precisely: "Any suggestions on why this is happening?"
Fundamentally, this is a bug--- it's been an open ticket at numpy since 2012.
Upvotes: 1
Reputation: 114811
You have to create a new dtype that contains the new field.
For example, here's a
:
In [86]: a
Out[86]:
array([(1, [-112.01268501699997, 40.64249414272372]),
(2, [-111.86145708699996, 40.4945008710162])],
dtype=[('i', '<i8'), ('loc', '<f8', (2,))])
a.dtype.descr
is [('i', '<i8'), ('loc', '<f8', (2,))]
; i.e. a list of field types. We'll create a new dtype by adding ('USNG', 'S100')
to the end of that list:
In [87]: new_dt = np.dtype(a.dtype.descr + [('USNG', 'S100')])
Now create a new structured array, b
. I used zeros
here, so the string fields will start out with the value ''
. You could also use empty
. The strings will then contain garbage, but that won't matter if you immediately assign values to them.
In [88]: b = np.zeros(a.shape, dtype=new_dt)
Copy over the existing data from a
to b
:
In [89]: b['i'] = a['i']
In [90]: b['loc'] = a['loc']
Here's b
now:
In [91]: b
Out[91]:
array([(1, [-112.01268501699997, 40.64249414272372], ''),
(2, [-111.86145708699996, 40.4945008710162], '')],
dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])
Fill in the new field with some data:
In [93]: b['USNG'] = ['FOO', 'BAR']
In [94]: b
Out[94]:
array([(1, [-112.01268501699997, 40.64249414272372], 'FOO'),
(2, [-111.86145708699996, 40.4945008710162], 'BAR')],
dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])
Upvotes: 19