Finn Årup Nielsen
Finn Årup Nielsen

Reputation: 6726

Why does Pandas coerce my numpy float32 to float64?

Why does Pandas coerce my numpy float32 to float64 in this piece of code:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([[1, 2, 'a'], [3, 4, 'b']], dtype=np.float32)
>>> A = df.ix[:, 0:1].values
>>> df.ix[:, 0:1] = A
>>> df[0].dtype
dtype('float64')

The behavior seems so odd to me that wonder if it is a bug. I am on Pandas version 0.17.1 (updated PyPI version) and I note there has been coercing bugs recently addressed, see https://github.com/pydata/pandas/issues/11847 . I haven't tried the piece of code with an updated GitHub master.

Is it a bug or do I misunderstand some "feature" in Pandas? If it is a feature, then how do I get around it?

(The coercing problem relates to a question I recently asked about the performance of Pandas assignments: Assignment of Pandas DataFrame with float32 and float64 slow)

Upvotes: 7

Views: 3208

Answers (2)

hpaulj
hpaulj

Reputation: 231385

Not an answer, but my recreation of the problem:

In [2]: df = pd.DataFrame([[1, 2, 'a'], [3, 4, 'b']], dtype=np.float32)
In [3]: df.dtypes
Out[3]: 
0    float32
1    float32
2     object
dtype: object
In [4]: A=df.ix[:,:1].values
In [5]: A
Out[5]: 
array([[ 1.,  2.],
       [ 3.,  4.]], dtype=float32)
In [6]: df.ix[:,:1] = A
In [7]: df.dtypes
Out[7]: 
0    float64
1    float64
2     object
dtype: object
In [8]: pd.__version__
Out[8]: '0.15.0'

I'm not as familiar with pandas as numpy, but I'm puzzled as to why ix[:,:1] gives me a 2 column result. In numpy that sort of indexing gives just 1 column.

If I assign a single column dtype does not change

In [47]: df.ix[:,[0]]=A[:,0]
In [48]: df.dtypes
Out[48]: 
0    float32
1    float32
2     object

The same actions without mixed datatypes does not change dtypes

In [100]: df1 = pd.DataFrame([[1, 2, 1.23], [3, 4, 3.32]], dtype=np.float32)
In [101]: A1=df1.ix[:,:1].values
In [102]: df1.ix[:,:1]=A1
In [103]: df1.dtypes
Out[103]: 
0    float32
1    float32
2    float32
dtype: object

The key must be that with mixed values, the dataframe is, in one sense or other, a dtype=object array, whether that's true of its internal data storage, or just its numpy interface.

In [104]: df1.as_matrix()
Out[104]: 
array([[ 1.        ,  2.        ,  1.23000002],
       [ 3.        ,  4.        ,  3.31999993]], dtype=float32)
In [105]: df.as_matrix()
Out[105]: 
array([[1.0, 2.0, 'a'],
       [3.0, 4.0, 'b']], dtype=object)

Upvotes: 2

Martin Valgur
Martin Valgur

Reputation: 6302

I think it is worth posting this as a GitHub issue. The behavior is certainly inconsistent.

The code takes a different branch based on whether the DataFrame is mixed-type or not (source).

  • In the mixed-type case the ndarray is converted to a Python list of float64 numbers and then converted back into float64 ndarray disregarding the DataFrame's dtypes information (function maybe_convert_objects()).

  • In the non-mixed-type case the DataFrame content is updated pretty much directly (source) and the DataFrame keeps its float32 dtypes.

Upvotes: 3

Related Questions