Reputation: 4768

Numpy astype "upcasting" array rather than applying dtypes across columns

I have a 2D numpy array, and I'd like to apply a specific dtype to each column.

a = np.arange(25).reshape((5,5))

In [40]: a
Out[40]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [41]: a.astype(dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

I was expecting line 41 to apply the dtypethat I desired, but instead it "upcast" by creating a new axis, replicating the whole array once for each of the dtypes:

Out[41]: 
array([[(0, 0, 0, 0.0, 0.0), (1, 1, 1, 1.0, 1.0), (2, 2, 2, 2.0, 2.0),
        (3, 3, 3, 3.0, 3.0), (4, 4, 4, 4.0, 4.0)],
       [(5, 5, 5, 5.0, 5.0), (6, 6, 6, 6.0, 6.0), (7, 7, 7, 7.0, 7.0),
        (8, 8, 8, 8.0, 8.0), (9, 9, 9, 9.0, 9.0)],
       [(10, 10, 10, 10.0, 10.0), (11, 11, 11, 11.0, 11.0),
        (12, 12, 12, 12.0, 12.0), (13, 13, 13, 13.0, 13.0),
        (14, 14, 14, 14.0, 14.0)],
       [(15, 15, 15, 15.0, 15.0), (16, 16, 16, 16.0, 16.0),
        (17, 17, 17, 17.0, 17.0), (18, 18, 18, 18.0, 18.0),
        (19, 19, 19, 19.0, 19.0)],
       [(20, 20, 20, 20.0, 20.0), (21, 21, 21, 21.0, 21.0),
        (22, 22, 22, 22.0, 22.0), (23, 23, 23, 23.0, 23.0),
        (24, 24, 24, 24.0, 24.0)]], 
      dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

Why did this happen, given that the number of dtypes matches the number of columns (and so I didn't expect upcasting)?

How can I take an existing array in memory and apply per-column dtypes, as I had intended on line 41? Thanks.

Upvotes: 3

Answers (3)

senderle

Reputation: 151147

This is an odd corner case that I've never encountered, but I believe the answer is related to the fact that in general, numpy only supports a few forms of assignment to structured arrays.

In this particular case, I think numpy is following the convention used for scalar assignment to structured arrays, and is then broadcasting the assignment over the whole input array to generate a result of the same shape as the original array.

Why the limit?

I believe the forms of assignment for structured arrays are limited because the "columns" of structured arrays are not much like the columns of ordinary 2-d arrays. In fact, it makes more sense to think of a ten-row, three-column structured array as a 1-d array of ten instances of an atomic row type.

These atomic row types are called "structured scalars". They have a fixed internal memory layout that cannot be dynamically reshaped, and so it doesn't really make sense to treat them the same way as the row of a 2-d array.

How to create a structured view of an existing array

Honestly, I don't know! I will update this answer if I find a good way. But I don't think I will find a good way, because as discussed above, structured scalars have their own distinctive memory layout. It's possible to hack up something with a buffer that has the right layout, but you'd be digging into numpy internals to do that, which isn't ideal. That being said, see this answer from Mad Physicist, who has done this somewhat more elegantly than I thought was possible.

It's also worth mentioning that astype creates a copy by default. You can pass copy=False, but numpy might still make a copy if certain requirements aren't satisfied.

Alternatives...

I rarely find that I actually need a view; often creating a copy causes no perceptible change in performance. My first approach to this problem would simply be to use one of the standard assignment strategies for record arrays. In this case, that would probably mean using subarray assignment. First we create the array. Note the tuples. They are required for expected behavior.

>>> a = np.array([(1, 2), (3, 4)], dtype=[('x', 'f8'), ('y', 'i8')])
>>> a
array([(1., 2), (3., 4)], dtype=[('x', '<f8'), ('y', '<i8')])

Now if we try to assign an ordinary 2-d array to a, we get an error:

>>> a[:] = np.array([[11, 22], [33, 44]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not broadcast input array from shape (2,2) into shape (2)

But we can easily assign in a column-wise way:

>>> a['x'] = [11, 22]
>>> a['y'] = [33, 44]
>>> a
array([(11., 33), (22., 44)], dtype=[('x', '<f8'), ('y', '<i8')])

We can also use Python tuples. This overwrites the whole array:

>>> a[:] = [(111, 222), (333, 444)]
>>> a
array([(111., 222), (333., 444)], dtype=[('x', '<f8'), ('y', '<i8')])

We can also assign data row-wise using tuples:

>>> a[1] = (3333, 4444)
>>> a
array([( 111.,  222), (3333., 4444)], dtype=[('x', '<f8'), ('y', '<i8')])

Again, this fails if we try to pass a list or array:

>>> a[1] = [3333, 4444]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
>>> a[1] = np.array([3333, 4444])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.

Finally, note that we see the same behavior you saw with astype when we try to create a structured array from nested lists or numpy arrays. numpy just broadcasts the input array against the datatype, producing a 2-d array of structured scalars:

>>> a
array([[(1., 1), (2., 2)],
       [(3., 3), (4., 4)]], dtype=[('x', '<f8'), ('y', '<i8')])
>>> a = np.array(np.array([[1, 2], [3, 4]]), dtype=[('x', 'f8'), ('y', 'i8')])
>>> a
array([[(1., 1), (2., 2)],
       [(3., 3), (4., 4)]], dtype=[('x', '<f8'), ('y', '<i8')])

If your goal is simply to create a new array, then see the answers to this question. They cover a couple of useful approaches, including numpy.core.records.fromarrays and numpy.core.records.fromrecords. See also Paul Panzer's answer, which discusses how to create a new record array (a structured array that allows attribute access to columns).

Upvotes: 1

Paul Panzer

Reputation: 53089

Here is a workaround using np.rec.fromarrays:

>>> dtype = [('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')]
>>> np.rec.fromarrays(a.T, dtype=dtype)
rec.array([( 0,  1,  2,  3.,  4.), ( 5,  6,  7,  8.,  9.),
           (10, 11, 12, 13., 14.), (15, 16, 17, 18., 19.),
           (20, 21, 22, 23., 24.)],
          dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

This is a recarray, but we can cast to ndarray if need be. In addition, the dtype is np.record we need to (view-) cast that to void to get a "clean" numpy result.

>>> np.asarray(np.rec.fromarrays(a.T, dtype=dtype)).view(dtype)
array([( 0,  1,  2,  3.,  4.), ( 5,  6,  7,  8.,  9.),
       (10, 11, 12, 13., 14.), (15, 16, 17, 18., 19.),
       (20, 21, 22, 23., 24.)],
      dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

Upvotes: 2

Mad Physicist

Reputation: 114518

As @senderle correctly points out, you rarely need a view, but here is a possible solution to do this almost in-place just for fun. The only modification you will need to do is to make sure your types are all of the same size.

a = np.arange(25, dtype='<i4').reshape((5,5))
b = a.view(dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])
b['score'] = a[:, -2, np.newaxis].astype('<f4')
b['auc'] = a[:, -1, np.newaxis].astype('<f4')

If we going to do non-recommended things, you can also insert a line b.shape = (5,) after getting the view to eliminate the extra dimension preserved from a, and make the assignments below that simpler.

This will give you a view b, which has all the desired properties, but of course will mess up the contents of a:

>>> a
array([[         0,          1,          2, 1077936128, 1082130432],
       [         5,          6,          7, 1090519040, 1091567616],
       [        10,         11,         12, 1095761920, 1096810496],
       [        15,         16,         17, 1099956224, 1100480512],
       [        20,         21,         22, 1102577664, 1103101952]])
>>> b
array([[( 0,  1,  2,  3.,  4.)],
       [( 5,  6,  7,  8.,  9.)],
       [(10, 11, 12, 13., 14.)],
       [(15, 16, 17, 18., 19.)],
       [(20, 21, 22, 23., 24.)]],
      dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])