RolfBly
RolfBly

Reputation: 3862

How does Pandas.read_csv type casting work?

Using pandas.read_csv with parse_dates option and a custom date parser, I find Pandas has a mind of its own about the data type it's reading.

Sample csv:

"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"

The actual datecleaner is here, but what I do boils down to this:

import pandas as pd

def dateclean(date):
    return str(int(date)) # Note: we return A STRING

df = pd.read_csv(
        'my.csv', 
        parse_dates=['birth_date'],
        date_parser=dateclean, 
        engine='python'
        )

print(df.birth_date)       

Output:

0       NaN
1    1625.0
2    1533.0
Name: birth_date, dtype: float64

I get type float64, even when I specified str. Also, take out the first line in the CSV, the one with the empty birth_date, and I get type int. The workaround is easy:

return '"{}"'.format(int(date))

Is there a better way?

In data analysis, I can imagine it's useful that Pandas will say 'Hey dude, you thought you were reading strings, but in fact they're numbers'. But what's the rationale for overruling me when I tell it not to?

Upvotes: 3

Views: 1575

Answers (2)

jpp
jpp

Reputation: 164673

The problem is date_parser is designed specifically for conversion to datetime:

date_parser : function, default None
Function to use for converting a sequence of string columns to an array of datetime instances.

There is no reason you should expect this parameter to work for other types. Instead, you can use the converters parameter. Here we use toolz.compose to apply int and then str. Alternatively, you can use lambda x: str(int(x)).

from io import StringIO
import pandas as pd
from toolz import compose

mystr = StringIO('''"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"''')

df = pd.read_csv(mystr, 
                 converters={'birth_date': compose(str, int)},
                 engine='python')

print(df.birth_date)       

0     NaN
1    1625
2    1533
Name: birth_date, dtype: object

If you need to replace NaN with empty strings, you can post-process with fillna:

print(df.birth_date.fillna(''))       

0        
1    1625
2    1533
Name: birth_date, dtype: object

Upvotes: 1

Evgeny
Evgeny

Reputation: 4551

Using parse_dates / date_parser looks a bit complicated for me, unless you want to generalise your import on many date columns. I think you have more control with converters parameter, where you can fit dateclean() function. You can also experiment with dtype parameter.

The problem with original dateclean() function is that it fails on "" value, because int("") raises ValueError. Pandas seem to resort to standard import when it encounters this problem, but it will fail explicitly with converters.

Below is the code to demonstrate a fix:

import pandas as pd
from pathlib import Path

doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""

Path('my.csv').write_text(doc)

def dateclean(date):
    try: 
       return str(int(date)) 
    except ValueError:
       return '' 

df = pd.read_csv(
        'my.csv', 
        parse_dates=['birth_date'],
        date_parser=dateclean, 
        engine='python'
        )

df2 = pd.read_csv(
        'my.csv', 
        converters = {'birth_date': dateclean}
        )

print(df2.birth_date)  

Hope it helps.

Upvotes: 2

Related Questions