argentum2f
argentum2f

Reputation: 5350

Pandas Float64 vs float64 dtypes (note capitalization) causing non-numeric errors?

I was getting some weird errors that after much searching appeared to (maybe) come from my data not being considered numeric in some cases. This seems to be because I used Float64 dtype (which I thought was what I was supposed to do).

TLDR; What's the difference between Float64 and float64? Why is use of Float64 data breaking a lot of stuff, such as pd.interpolate? What is even the purpose of Float64 existing?

Example:

import pandas as pd
import numpy as np                                                             
                                                                                
TESTDATA = u"""\                                                                
    val1, val2, val3                                                            
     1.0,  2.0,  3.0                                                            
     4.0,  5.0,  6.0                                                            
     7.0,  8.0,  9.0                                                            
    10.0, NaN, 12.0                                                             
    13.0, 14.0, 15.0                                                            
"""                                                                             
                                                                                
df = pd.read_csv(StringIO(TESTDATA), sep=r",\s*", engine='python', dtype=pd.Floa
t64Dtype())                                                                     
                                                                                
print(df)                                                                       
print()                                                                         
print(df.dtypes) 

This outputs:

   val1  val2  val3
0   1.0   2.0   3.0
1   4.0   5.0   6.0
2   7.0   8.0   9.0
3  10.0  <NA>  12.0
4  13.0  14.0  15.0

val1    Float64
val2    Float64
val3    Float64
dtype: object

So far everything looks good (as expected), but now I try:

df.interpolate()

and get:

ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

This was rather baffling to me until I came across other answers and realized that this error might be coming about because interpolate thought the data was non-numeric and was therefore limiting the valid fill methods to ffill/bfill.

So I found that the following works:

df = df.astype(np.float64).interpolate()                                             
print(df.dtypes)                                                                
print()                                                                         
print(df)

with output:

val1    float64
val2    float64
val3    float64
dtype: object

   val1  val2  val3
0   1.0   2.0   3.0
1   4.0   5.0   6.0
2   7.0   8.0   9.0
3  10.0  11.0  12.0
4  13.0  14.0  15.0

Note that giving it np.float64 or just float gives the same result.

Running pd.to_numeric(df.val1) on the Float64 dataframe returned a series that still has Float64 type, indicating that pandas does seem to recognize that Float64 is numeric.

Upvotes: 18

Views: 12389

Answers (2)

mirekphd
mirekphd

Reputation: 6763

If you don't see the point (no data loss) you can manually downcast the column to a standard numpy type by passing the column values through a numpy array and changing its type, here: to numpy.float64 (which reconstructs also the index):

df[col_name] = df[col_name].values.astype(float)

Upvotes: 6

hpaulj
hpaulj

Reputation: 231335

In [52]: pd.Float64Dtype?
Init signature: pd.Float64Dtype()
Docstring:     
An ExtensionDtype for float64 data.

This dtype uses ``pd.NA`` as missing value indicator.

With a float dtype, the frame displays as

In [68]: df
Out[68]: 
   val1  val2  val3
0   1.0   2.0   3.0
1   4.0   5.0   6.0
2   7.0   8.0   9.0
3  10.0   NaN  12.0
4  13.0  14.0  15.0

where the NaN is the np.nan, a valid float.

In [71]: df
Out[71]: 
   val1  val2  val3
0   1.0   2.0   3.0
1   4.0   5.0   6.0
2   7.0   8.0   9.0
3  10.0  <NA>  12.0
4  13.0  14.0  15.0

where that <NA> is pandas._libs.missing.NAType

Your df.interpolate() error indicates that the extension dtype was not implemented for all operations. Some places suggest it is still experimental.

Upvotes: 5

Related Questions