Reputation: 139172
I am trying to read a csv file, of which a sample:
datetime,check,lat,lon,co_alpha,atn,status,bc
2012-10-27 15:00:59,2,0,0,2.427,,,
2012-10-27 15:01:00,2,0,0,2.407,,,
2012-10-27 15:02:49,2,0,0,2.207,-17.358,0,-16162
2012-10-27 15:02:50,2,0,0,2.207,-17.354,0,8192
2012-10-27 15:02:51,1,0,0,2.207,-17.358,0,-8152
2012-10-27 15:02:52,1,0,0,2.207,-17.358,0,648
2012-10-27 15:06:03,0,51.195076,4.444407,2.349,-17.289,0,4909
2012-10-27 15:06:04,0,51.195182,4.44427,2.344,-17.289,0,587
2012-12-05 09:21:34,,,,,42.960,1,16430
2012-12-05 09:21:35,,,,,42.962,1,3597
The problem I encounter is that in columns with only ints, the 0's are converted to NaN (eg columns 'check' and 'status', these are columns with only ints, but the column is read as floats because there are real missing values). But I only want the empty values to be converted to NaN, and not the zeros.
This is what I get:
>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
check lat lon co_alpha atn status bc
datetime
2012-10-27 15:00:59 2 0.000000 0.000000 2.427 NaN NaN NaN
2012-10-27 15:01:00 2 0.000000 0.000000 2.407 NaN NaN NaN
2012-10-27 15:02:49 2 0.000000 0.000000 2.207 -17.358 NaN -16162
2012-10-27 15:02:50 2 0.000000 0.000000 2.207 -17.354 NaN 8192
2012-10-27 15:02:51 1 0.000000 0.000000 2.207 -17.358 NaN -8152
2012-10-27 15:02:52 1 0.000000 0.000000 2.207 -17.358 NaN 648
2012-10-27 15:06:03 NaN 51.195076 4.444407 2.349 -17.289 NaN 4909
2012-10-27 15:06:04 NaN 51.195182 4.444270 2.344 -17.289 NaN 587
2012-12-05 09:21:34 NaN NaN NaN NaN 42.960 1 16430
2012-12-05 09:21:35 NaN NaN NaN NaN 42.962 1 3597
So, in the columns 'check' and 'status', there are to many NaN's. In the 'lat' and 'lon' columns the 0's are not converted to NaN's.
Using na_values=''
and keep_default_na=False
does not help. Is there a way to specify to not convert int 0's to NaN? Or is this a bug?
I could specify the dtype of the specific columns as int with the dtype
keyword. This keeps the 0's as 0's, but the problem is that those columns also contain real NaN's (empty values). So, in this case, these values are also converted to 0's as in an int column you cannot have NaN's. For this reason, I have to keep all columns as floats.
EDIT: after upgrading to pandas 0.10.1, it works as expected even without specifying keep_default_na
and na_values
:
>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
check lat lon co_alpha atn status bc
datetime
2012-10-27 15:00:59 2 0.000000 0.000000 2.427 NaN NaN NaN
2012-10-27 15:01:00 2 0.000000 0.000000 2.407 NaN NaN NaN
2012-10-27 15:02:49 2 0.000000 0.000000 2.207 -17.358 0 -16162
2012-10-27 15:02:50 2 0.000000 0.000000 2.207 -17.354 0 8192
2012-10-27 15:02:51 1 0.000000 0.000000 2.207 -17.358 0 -8152
2012-10-27 15:02:52 1 0.000000 0.000000 2.207 -17.358 0 648
2012-10-27 15:06:03 0 51.195076 4.444407 2.349 -17.289 0 4909
2012-10-27 15:06:04 0 51.195182 4.444270 2.344 -17.289 0 587
2012-12-05 09:21:34 NaN NaN NaN NaN 42.960 1 16430
2012-12-05 09:21:35 NaN NaN NaN NaN 42.962 1 3597
Upvotes: 1
Views: 4656
Reputation: 375535
You have to first set keep_default_na
to False
:
df = pd.read_clipboard(sep=',', index_col=0, keep_default_na=False, na_values='')
In [2]: df
Out[2]:
check lat lon co_alpha atn status bc
datetime
2012-10-27 15:00:59 2 0.000000 0.000000 2.427 NaN NaN NaN
2012-10-27 15:01:00 2 0.000000 0.000000 2.407 NaN NaN NaN
2012-10-27 15:02:49 2 0.000000 0.000000 2.207 -17.358 0 -16162
2012-10-27 15:02:50 2 0.000000 0.000000 2.207 -17.354 0 8192
2012-10-27 15:02:51 1 0.000000 0.000000 2.207 -17.358 0 -8152
2012-10-27 15:02:52 1 0.000000 0.000000 2.207 -17.358 0 648
2012-10-27 15:06:03 0 51.195076 4.444407 2.349 -17.289 0 4909
2012-10-27 15:06:04 0 51.195182 4.444270 2.344 -17.289 0 587
2012-12-05 09:21:34 NaN NaN NaN NaN 42.960 1 16430
2012-12-05 09:21:35 NaN NaN NaN NaN 42.962 1 3597
From the doc-string of read_tables
:
keep_default_na
: bool, default True
Ifna_values
are specified andkeep_default_na
isFalse
the defaultNaN
values are overridden, otherwise they're appended to
na_values
: list-like or dict, defaultNone
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values
Upvotes: 5