joris
joris

Reputation: 139172

Pandas read_csv(): keep 0 as 0 (not convert it to NaN)

I am trying to read a csv file, of which a sample:

datetime,check,lat,lon,co_alpha,atn,status,bc
2012-10-27 15:00:59,2,0,0,2.427,,,
2012-10-27 15:01:00,2,0,0,2.407,,,
2012-10-27 15:02:49,2,0,0,2.207,-17.358,0,-16162
2012-10-27 15:02:50,2,0,0,2.207,-17.354,0,8192
2012-10-27 15:02:51,1,0,0,2.207,-17.358,0,-8152
2012-10-27 15:02:52,1,0,0,2.207,-17.358,0,648
2012-10-27 15:06:03,0,51.195076,4.444407,2.349,-17.289,0,4909
2012-10-27 15:06:04,0,51.195182,4.44427,2.344,-17.289,0,587
2012-12-05 09:21:34,,,,,42.960,1,16430
2012-12-05 09:21:35,,,,,42.962,1,3597

The problem I encounter is that in columns with only ints, the 0's are converted to NaN (eg columns 'check' and 'status', these are columns with only ints, but the column is read as floats because there are real missing values). But I only want the empty values to be converted to NaN, and not the zeros.

This is what I get:

>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358     NaN -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354     NaN   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358     NaN  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358     NaN    648
2012-10-27 15:06:03    NaN  51.195076  4.444407     2.349 -17.289     NaN   4909
2012-10-27 15:06:04    NaN  51.195182  4.444270     2.344 -17.289     NaN    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

So, in the columns 'check' and 'status', there are to many NaN's. In the 'lat' and 'lon' columns the 0's are not converted to NaN's.


EDIT: after upgrading to pandas 0.10.1, it works as expected even without specifying keep_default_na and na_values:

>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358       0 -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354       0   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358       0  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358       0    648
2012-10-27 15:06:03      0  51.195076  4.444407     2.349 -17.289       0   4909
2012-10-27 15:06:04      0  51.195182  4.444270     2.344 -17.289       0    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

Upvotes: 1

Views: 4656

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375535

You have to first set keep_default_na to False:

df = pd.read_clipboard(sep=',', index_col=0, keep_default_na=False, na_values='')

In [2]: df
Out[2]: 
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358       0 -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354       0   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358       0  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358       0    648
2012-10-27 15:06:03      0  51.195076  4.444407     2.349 -17.289       0   4909
2012-10-27 15:06:04      0  51.195182  4.444270     2.344 -17.289       0    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

From the doc-string of read_tables:

keep_default_na : bool, default True
     If na_values are specified and keep_default_na is False the default NaN
    values are overridden, otherwise they're appended to

na_values : list-like or dict, default None
    Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

Upvotes: 5

Related Questions