How does Pandas.read_csv type casting work?

Question

Using pandas.read_csv with parse_dates option and a custom date parser, I find Pandas has a mind of its own about the data type it's reading.

Sample csv:

"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"

The actual datecleaner is here, but what I do boils down to this:

import pandas as pd

def dateclean(date):
    return str(int(date)) # Note: we return A STRING

df = pd.read_csv(
        'my.csv', 
        parse_dates=['birth_date'],
        date_parser=dateclean, 
        engine='python'
        )

print(df.birth_date)

Output:

0       NaN
1    1625.0
2    1533.0
Name: birth_date, dtype: float64

I get type float64, even when I specified str. Also, take out the first line in the CSV, the one with the empty birth_date, and I get type int. The workaround is easy:

return '"{}"'.format(int(date))

Is there a better way?

In data analysis, I can imagine it's useful that Pandas will say 'Hey dude, you thought you were reading strings, but in fact they're numbers'. But what's the rationale for overruling me when I tell it not to?

Evgeny · Accepted Answer

Using parse_dates / date_parser looks a bit complicated for me, unless you want to generalise your import on many date columns. I think you have more control with converters parameter, where you can fit dateclean() function. You can also experiment with dtype parameter.

The problem with original dateclean() function is that it fails on "" value, because int("") raises ValueError. Pandas seem to resort to standard import when it encounters this problem, but it will fail explicitly with converters.

Below is the code to demonstrate a fix:

import pandas as pd
from pathlib import Path

doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""

Path('my.csv').write_text(doc)

def dateclean(date):
    try: 
       return str(int(date)) 
    except ValueError:
       return '' 

df = pd.read_csv(
        'my.csv', 
        parse_dates=['birth_date'],
        date_parser=dateclean, 
        engine='python'
        )

df2 = pd.read_csv(
        'my.csv', 
        converters = {'birth_date': dateclean}
        )

print(df2.birth_date)

Hope it helps.

How does Pandas.read_csv type casting work?

Answers (2)

Related Questions