Sean Hammond
Sean Hammond

Reputation: 14750

How to get pandas.read_csv() to infer datetime and timedelta types from CSV file columns?

pandas.read_csv() infers the types of columns, but I can't get it to infer any datetime or timedelta type (e.g. datetime64, timedelta64) for columns whose values seem like obvious datetimes and time deltas.

Here's an example CSV file:

datetime,timedelta,integer,number,boolean,string
20111230 00:00:00,one hour,10,1.6,True,Foobar

And some code to read it with pandas:

dataframe = pandas.read_csv(path)

The types of the columns on that dataframe come out as object, object, int, float, bool, object. They're all as I would expect except the first two columns, which I want to be datetime and timedelta.

Is it possible to get pandas to automatically detect datetime and timedelta columns?

(I don't want to have to tell pandas which columns are datetimes and timedeltas or tell it the formats, I want it to try and detect them automatically like it does for into, float and bool columns.)

Upvotes: 11

Views: 18320

Answers (2)

Julius Sechang Mboli
Julius Sechang Mboli

Reputation: 60

This is how I use it for multiple columns that are in the format of datetime.

parse_dates=['Start-time', 'End-time', 'Manufacturing date',
                                'Expiry Date'], infer_datetime_format=True

The infer_datetime_format=True is good as it will ignore any column that is not in the datetime format. This makes me think that it might be good if there was a way of applying the codes to all the columns in the csv file. Especially if you have like 30 or more columns to declare the dtypes as datetime. It does not work for timedelta64, though.

Upvotes: 0

EdChum
EdChum

Reputation: 394459

One thing you can do is define your date parser using strptime, this will handle your date format, this isn't automatic though:

In [59]:

import pandas as pd
import datetime as dt

def parse_dates(x):
    return dt.datetime.strptime(x, '%Y%m%d %H:%M:%S')
# dict for word lookup, conversion
word_to_int={'zero':0,
     'one':1,
     'two':2,
     'three':3,
     'four':4,
     'five':5,
     'six':6,
     'seven':7,
     'eight':8,
     'nine':9}


def str_to_time_delta(x):
    num = 0
    if 'hour' in x.lower():
        num = x[0:x.find(' ')].lower()
    return dt.timedelta( hours = word_to_int[num])
df = pd.read_csv(r'c:\temp1.txt', parse_dates=[0],date_parser=parse_dates)
df.dtypes
Out[59]:
datetime     datetime64[ns]
timedelta            object
integer               int64
number              float64
boolean                bool
string               object
dtype: object
In [60]:

Then to convert to timedeltas use the dict and function to parse and convert to timedeltas

df['timedelta'] = df['timedelta'].map(str_to_time_delta)

In [61]:

df.dtypes
Out[61]:
datetime      datetime64[ns]
timedelta    timedelta64[ns]
integer                int64
number               float64
boolean                 bool
string                object
dtype: object
In [62]:

df
Out[62]:
             datetime  timedelta  integer  number boolean  string
0 2011-12-30 00:00:00   01:00:00       10     1.6    True  Foobar

[1 rows x 6 columns]

To answer your principal question I don't know of a way to automatically do this.

EDIT

Instead of my convoluted mapping function you can do just this:

df['timedelta'] = pd.to_timedelta(df['timedelta'])

Further edit

As noted by @Jeff you can do this instead of using strptime when reading the csv (in version 0.13.1 and above though):

df = pd.read_csv(r'c:\temp1.txt', parse_dates=[0], infer_datetime_format=True)

Upvotes: 6

Related Questions