Baron Yugovich
Baron Yugovich

Reputation: 4307

Pandas Dataframe - faster apply?

I've got the following code:

from dateutil import parser
df.local_time = df.local_time.apply(lambda x: parser.parse(x))

It seems to be taking prohibitively long time. How can I make it faster?

Upvotes: 0

Views: 501

Answers (1)

jakevdp
jakevdp

Reputation: 86330

You should use pd.to_datetime for faster datetime conversion. For example, imagine you have this data:

In [1]: import pandas as pd
        dates = pd.date_range('2015', freq='min', periods=1000)
        dates = [d.strftime('%d %b %Y %H:%M:%S') for d in dates]
        dates[:5]
Out[1]:
['01 Jan 2015 00:00:00',
 '01 Jan 2015 00:01:00',
 '01 Jan 2015 00:02:00',
 '01 Jan 2015 00:03:00',
 '01 Jan 2015 00:04:00']

You can get datetime objects this way:

In [2]: pd.to_datetime(dates[:5])
Out[2]:
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 00:01:00',
               '2015-01-01 00:02:00', '2015-01-01 00:03:00',
               '2015-01-01 00:04:00'],
              dtype='datetime64[ns]', freq=None)

But this still can be slow in some cases. To be really fast on converting dates from strings where you know that all dates have the same format, you can specify the format argument (e.g. here, format='%d %b %Y %H:%M:%S') or more automatically, use infer_datetime_format=True so that the format will be inferred only once and used on the rest of the entries. This can result in some great speedups as the size of the array grows (but only works if all formats are consistent!).

For example, on these 1000 string dates I defined above:

from dateutil import parser
ser = pd.Series(dates)

%timeit ser.apply(lambda x: parser.parse(x))
10 loops, best of 3: 91.1 ms per loop

%timeit pd.to_datetime(dates)
10 loops, best of 3: 139 ms per loop

%timeit pd.to_datetime(dates, format='%d %b %Y %H:%M:%S')
100 loops, best of 3: 5.96 ms per loop

%timeit pd.to_datetime(dates, infer_datetime_format=True)
100 loops, best of 3: 6.79 ms per loop

We get about a factor of 20 speedup by specifying or inferring the datetime format in pd.to_datetime().

Upvotes: 4

Related Questions