Reputation: 4774
I am using numpy's .astype()
method to convert the data types, however, it gives the strange result, Suppose the following code:
import pandas as pd
import numpy as np
import sys
df = pd.DataFrame([[0.1, 2, 'a']], columns=["a1", "a2", "str"])
arr = df.to_records(index=False)
dtype1 = [('a1', np.float32), ('a2', np.int32), ('str', '|S2')]
dtype2 = [('a2', np.int32), ('a1', np.float32), ('str', '|S2')]
arr1 = arr.astype(dtype1)
arr2 = arr.astype(dtype2)
print(arr1)
print(arr2)
print(arr)
print(sys.version)
print(np.__version__)
print(pd.__version__)
I have test it on different python version, and gives me the different result. The newer version gives me the unexpected result:
[(0.1, 2, b'a')]
[(0, 2., b'a')]
[(0.1, 2, 'a')]
3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
1.15.0
0.23.4
While the older version give the correct result:
[(0.10000000149011612, 2, 'a') (0.10000000149011612, 2, 'b')]
[(2, 0.10000000149011612, 'a') (2, 0.10000000149011612, 'b')]
[(0.1, 2L, 'a') (0.1, 2L, 'b')]
2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:53:40) [MSC v.1500 64 bit (AMD64)]
1.11.1
0.20.3
Can someone tell me what is going on?
Upvotes: 0
Views: 242
Reputation: 231385
https://docs.scipy.org/doc/numpy/user/basics.rec.html#assignment-from-other-structured-arrays
says that assignment from other structured arrays is by position, not by field name. I think that applies to astype
. If so it means you can't reorder fields with an astype
.
Accessing multiple fields at once has changed in recent releases, and may change more. Part of it is whether such access should be a copy or view.
recfunctions
has code for adding, deleting or merging fields. A common strategy is to create a target array with the new dtype, and copy values to it by field name. This is iterative but since typically an array will have many more records than fields the time penalty isn't big,
In version 1.14, I can do:
In [152]: dt1 = np.dtype([('a',float),('b',int), ('c','U3')])
In [153]: dt2 = np.dtype([('b',int),('a',float), ('c','S3')])
In [154]: arr1 = np.array([(1,2,'a'),(3,4,'b'),(5,6,'c')], dt1)
In [155]: arr1
Out[155]:
array([(1., 2, 'a'), (3., 4, 'b'), (5., 6, 'c')],
dtype=[('a', '<f8'), ('b', '<i8'), ('c', '<U3')])
Simply using astype
does not reorder the fields:
In [156]: arr1.astype(dt2)
Out[156]:
array([(1, 2., b'a'), (3, 4., b'b'), (5, 6., b'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])
but multifield indexing does:
In [157]: arr1[['b','a','c']]
Out[157]:
array([(2, 1., 'a'), (4, 3., 'b'), (6, 5., 'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', '<U3')])
now the dt2
astype is right:
In [158]: arr2 = arr1[['b','a','c']].astype(dt2)
In [159]: arr2
Out[159]:
array([(2, 1., b'a'), (4, 3., b'b'), (6, 5., b'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])
In [160]: arr1['a']
Out[160]: array([1., 3., 5.])
In [161]: arr2['a']
Out[161]: array([1., 3., 5.])
This is 1.14; you are using 1.15, and the docs mention differences in 1.16. So this is a moving target.
The astype
is behaving the same as assignment to 'blank' array:
In [162]: arr2 = np.zeros(arr1.shape, dt2)
In [163]: arr2
Out[163]:
array([(0, 0., b''), (0, 0., b''), (0, 0., b'')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])
In [164]: arr2[:] = arr1
In [165]: arr2
Out[165]:
array([(1, 2., b'a'), (3, 4., b'b'), (5, 6., b'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])
In [166]: arr2[:] = arr1[['b','a','c']]
In [167]: arr2
Out[167]:
array([(2, 1., b'a'), (4, 3., b'b'), (6, 5., b'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])
Upvotes: 1