Reputation: 4768
I have a 2D numpy array, and I'd like to apply a specific dtype
to each column.
a = np.arange(25).reshape((5,5))
In [40]: a
Out[40]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
In [41]: a.astype(dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])
I was expecting line 41 to apply the dtype
that I desired, but instead it "upcast" by creating a new axis, replicating the whole array once for each of the dtypes:
Out[41]:
array([[(0, 0, 0, 0.0, 0.0), (1, 1, 1, 1.0, 1.0), (2, 2, 2, 2.0, 2.0),
(3, 3, 3, 3.0, 3.0), (4, 4, 4, 4.0, 4.0)],
[(5, 5, 5, 5.0, 5.0), (6, 6, 6, 6.0, 6.0), (7, 7, 7, 7.0, 7.0),
(8, 8, 8, 8.0, 8.0), (9, 9, 9, 9.0, 9.0)],
[(10, 10, 10, 10.0, 10.0), (11, 11, 11, 11.0, 11.0),
(12, 12, 12, 12.0, 12.0), (13, 13, 13, 13.0, 13.0),
(14, 14, 14, 14.0, 14.0)],
[(15, 15, 15, 15.0, 15.0), (16, 16, 16, 16.0, 16.0),
(17, 17, 17, 17.0, 17.0), (18, 18, 18, 18.0, 18.0),
(19, 19, 19, 19.0, 19.0)],
[(20, 20, 20, 20.0, 20.0), (21, 21, 21, 21.0, 21.0),
(22, 22, 22, 22.0, 22.0), (23, 23, 23, 23.0, 23.0),
(24, 24, 24, 24.0, 24.0)]],
dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])
Why did this happen, given that the number of dtypes matches the number of columns (and so I didn't expect upcasting)?
How can I take an existing array in memory and apply per-column dtypes, as I had intended on line 41? Thanks.
Upvotes: 3
Views: 556
Reputation: 151147
This is an odd corner case that I've never encountered, but I believe the answer is related to the fact that in general, numpy
only supports a few forms of assignment to structured arrays.
In this particular case, I think numpy
is following the convention used for scalar assignment to structured arrays, and is then broadcasting the assignment over the whole input array to generate a result of the same shape as the original array.
I believe the forms of assignment for structured arrays are limited because the "columns" of structured arrays are not much like the columns of ordinary 2-d arrays. In fact, it makes more sense to think of a ten-row, three-column structured array as a 1-d array of ten instances of an atomic row type.
These atomic row types are called "structured scalars". They have a fixed internal memory layout that cannot be dynamically reshaped, and so it doesn't really make sense to treat them the same way as the row of a 2-d array.
Honestly, I don't know! I will update this answer if I find a good way. But I don't think I will find a good way, because as discussed above, structured scalars have their own distinctive memory layout. It's possible to hack up something with a buffer that has the right layout, but you'd be digging into numpy
internals to do that, which isn't ideal. That being said, see this answer from Mad Physicist, who has done this somewhat more elegantly than I thought was possible.
It's also worth mentioning that astype
creates a copy by default. You can pass copy=False
, but numpy
might still make a copy if certain requirements aren't satisfied.
I rarely find that I actually need a view; often creating a copy causes no perceptible change in performance. My first approach to this problem would simply be to use one of the standard assignment strategies for record arrays. In this case, that would probably mean using subarray assignment. First we create the array. Note the tuples. They are required for expected behavior.
>>> a = np.array([(1, 2), (3, 4)], dtype=[('x', 'f8'), ('y', 'i8')])
>>> a
array([(1., 2), (3., 4)], dtype=[('x', '<f8'), ('y', '<i8')])
Now if we try to assign an ordinary 2-d array to a
, we get an error:
>>> a[:] = np.array([[11, 22], [33, 44]])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not broadcast input array from shape (2,2) into shape (2)
But we can easily assign in a column-wise way:
>>> a['x'] = [11, 22]
>>> a['y'] = [33, 44]
>>> a
array([(11., 33), (22., 44)], dtype=[('x', '<f8'), ('y', '<i8')])
We can also use Python tuples. This overwrites the whole array:
>>> a[:] = [(111, 222), (333, 444)]
>>> a
array([(111., 222), (333., 444)], dtype=[('x', '<f8'), ('y', '<i8')])
We can also assign data row-wise using tuples:
>>> a[1] = (3333, 4444)
>>> a
array([( 111., 222), (3333., 4444)], dtype=[('x', '<f8'), ('y', '<i8')])
Again, this fails if we try to pass a list or array:
>>> a[1] = [3333, 4444]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
>>> a[1] = np.array([3333, 4444])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
Finally, note that we see the same behavior you saw with astype
when we try to create a structured array from nested lists or numpy
arrays. numpy
just broadcasts the input array against the datatype, producing a 2-d array of structured scalars:
>>> a
array([[(1., 1), (2., 2)],
[(3., 3), (4., 4)]], dtype=[('x', '<f8'), ('y', '<i8')])
>>> a = np.array(np.array([[1, 2], [3, 4]]), dtype=[('x', 'f8'), ('y', 'i8')])
>>> a
array([[(1., 1), (2., 2)],
[(3., 3), (4., 4)]], dtype=[('x', '<f8'), ('y', '<i8')])
If your goal is simply to create a new array, then see the answers to this question. They cover a couple of useful approaches, including numpy.core.records.fromarrays
and numpy.core.records.fromrecords
. See also Paul Panzer's answer, which discusses how to create a new record array (a structured array that allows attribute access to columns).
Upvotes: 1
Reputation: 53089
Here is a workaround using np.rec.fromarrays
:
>>> dtype = [('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')]
>>> np.rec.fromarrays(a.T, dtype=dtype)
rec.array([( 0, 1, 2, 3., 4.), ( 5, 6, 7, 8., 9.),
(10, 11, 12, 13., 14.), (15, 16, 17, 18., 19.),
(20, 21, 22, 23., 24.)],
dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])
This is a recarray
, but we can cast to ndarray
if need be. In addition, the dtype is np.record
we need to (view-) cast that to void
to get a "clean" numpy result.
>>> np.asarray(np.rec.fromarrays(a.T, dtype=dtype)).view(dtype)
array([( 0, 1, 2, 3., 4.), ( 5, 6, 7, 8., 9.),
(10, 11, 12, 13., 14.), (15, 16, 17, 18., 19.),
(20, 21, 22, 23., 24.)],
dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])
Upvotes: 2
Reputation: 114518
As @senderle correctly points out, you rarely need a view
, but here is a possible solution to do this almost in-place just for fun. The only modification you will need to do is to make sure your types are all of the same size.
a = np.arange(25, dtype='<i4').reshape((5,5))
b = a.view(dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])
b['score'] = a[:, -2, np.newaxis].astype('<f4')
b['auc'] = a[:, -1, np.newaxis].astype('<f4')
If we going to do non-recommended things, you can also insert a line b.shape = (5,)
after getting the view to eliminate the extra dimension preserved from a
, and make the assignments below that simpler.
This will give you a view b
, which has all the desired properties, but of course will mess up the contents of a
:
>>> a
array([[ 0, 1, 2, 1077936128, 1082130432],
[ 5, 6, 7, 1090519040, 1091567616],
[ 10, 11, 12, 1095761920, 1096810496],
[ 15, 16, 17, 1099956224, 1100480512],
[ 20, 21, 22, 1102577664, 1103101952]])
>>> b
array([[( 0, 1, 2, 3., 4.)],
[( 5, 6, 7, 8., 9.)],
[(10, 11, 12, 13., 14.)],
[(15, 16, 17, 18., 19.)],
[(20, 21, 22, 23., 24.)]],
dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])
Upvotes: 2