Reputation: 6726
Why does Pandas coerce my numpy float32 to float64 in this piece of code:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([[1, 2, 'a'], [3, 4, 'b']], dtype=np.float32)
>>> A = df.ix[:, 0:1].values
>>> df.ix[:, 0:1] = A
>>> df[0].dtype
dtype('float64')
The behavior seems so odd to me that wonder if it is a bug. I am on Pandas version 0.17.1 (updated PyPI version) and I note there has been coercing bugs recently addressed, see https://github.com/pydata/pandas/issues/11847 . I haven't tried the piece of code with an updated GitHub master.
Is it a bug or do I misunderstand some "feature" in Pandas? If it is a feature, then how do I get around it?
(The coercing problem relates to a question I recently asked about the performance of Pandas assignments: Assignment of Pandas DataFrame with float32 and float64 slow)
Upvotes: 7
Views: 3208
Reputation: 231385
Not an answer, but my recreation of the problem:
In [2]: df = pd.DataFrame([[1, 2, 'a'], [3, 4, 'b']], dtype=np.float32)
In [3]: df.dtypes
Out[3]:
0 float32
1 float32
2 object
dtype: object
In [4]: A=df.ix[:,:1].values
In [5]: A
Out[5]:
array([[ 1., 2.],
[ 3., 4.]], dtype=float32)
In [6]: df.ix[:,:1] = A
In [7]: df.dtypes
Out[7]:
0 float64
1 float64
2 object
dtype: object
In [8]: pd.__version__
Out[8]: '0.15.0'
I'm not as familiar with pandas
as numpy
, but I'm puzzled as to why ix[:,:1]
gives me a 2 column result. In numpy
that sort of indexing gives just 1 column.
If I assign a single column dtype
does not change
In [47]: df.ix[:,[0]]=A[:,0]
In [48]: df.dtypes
Out[48]:
0 float32
1 float32
2 object
The same actions without mixed datatypes does not change dtypes
In [100]: df1 = pd.DataFrame([[1, 2, 1.23], [3, 4, 3.32]], dtype=np.float32)
In [101]: A1=df1.ix[:,:1].values
In [102]: df1.ix[:,:1]=A1
In [103]: df1.dtypes
Out[103]:
0 float32
1 float32
2 float32
dtype: object
The key must be that with mixed values, the dataframe is, in one sense or other, a dtype=object
array, whether that's true of its internal data storage, or just its numpy
interface.
In [104]: df1.as_matrix()
Out[104]:
array([[ 1. , 2. , 1.23000002],
[ 3. , 4. , 3.31999993]], dtype=float32)
In [105]: df.as_matrix()
Out[105]:
array([[1.0, 2.0, 'a'],
[3.0, 4.0, 'b']], dtype=object)
Upvotes: 2
Reputation: 6302
I think it is worth posting this as a GitHub issue. The behavior is certainly inconsistent.
The code takes a different branch based on whether the DataFrame is mixed-type or not (source).
In the mixed-type case the ndarray is converted to a Python list of float64 numbers and then converted back into float64 ndarray disregarding the DataFrame's dtypes information (function maybe_convert_objects()).
In the non-mixed-type case the DataFrame content is updated pretty much directly (source) and the DataFrame keeps its float32 dtypes.
Upvotes: 3