Reputation: 3862
Using pandas.read_csv
with parse_dates
option and a custom date parser, I find Pandas has a mind of its own about the data type it's reading.
Sample csv:
"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
The actual datecleaner is here, but what I do boils down to this:
import pandas as pd
def dateclean(date):
return str(int(date)) # Note: we return A STRING
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
print(df.birth_date)
Output:
0 NaN
1 1625.0
2 1533.0
Name: birth_date, dtype: float64
I get type float64
, even when I specified str
. Also, take out the first line in the CSV, the one with the empty birth_date, and I get type int
. The workaround is easy:
return '"{}"'.format(int(date))
Is there a better way?
In data analysis, I can imagine it's useful that Pandas will say 'Hey dude, you thought you were reading strings, but in fact they're numbers'. But what's the rationale for overruling me when I tell it not to?
Upvotes: 3
Views: 1575
Reputation: 164673
The problem is date_parser
is designed specifically for conversion to datetime
:
date_parser : function, default None
Function to use for converting a sequence of string columns to an array of datetime instances.
There is no reason you should expect this parameter to work for other types. Instead, you can use the converters
parameter. Here we use toolz.compose
to apply int
and then str
. Alternatively, you can use lambda x: str(int(x))
.
from io import StringIO
import pandas as pd
from toolz import compose
mystr = StringIO('''"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"''')
df = pd.read_csv(mystr,
converters={'birth_date': compose(str, int)},
engine='python')
print(df.birth_date)
0 NaN
1 1625
2 1533
Name: birth_date, dtype: object
If you need to replace NaN
with empty strings, you can post-process with fillna
:
print(df.birth_date.fillna(''))
0
1 1625
2 1533
Name: birth_date, dtype: object
Upvotes: 1
Reputation: 4551
Using parse_dates
/ date_parser
looks a bit complicated for me, unless you want to generalise your import on many date columns. I think you have more control with converters
parameter, where you can fit dateclean()
function. You can also experiment with dtype
parameter.
The problem with original dateclean()
function is that it fails on ""
value, because int("")
raises ValueError
. Pandas seem to resort to standard import when it encounters this problem, but it will fail explicitly with converters
.
Below is the code to demonstrate a fix:
import pandas as pd
from pathlib import Path
doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""
Path('my.csv').write_text(doc)
def dateclean(date):
try:
return str(int(date))
except ValueError:
return ''
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
df2 = pd.read_csv(
'my.csv',
converters = {'birth_date': dateclean}
)
print(df2.birth_date)
Hope it helps.
Upvotes: 2