Reputation: 4307
I've got the following code:
from dateutil import parser
df.local_time = df.local_time.apply(lambda x: parser.parse(x))
It seems to be taking prohibitively long time. How can I make it faster?
Upvotes: 0
Views: 501
Reputation: 86330
You should use pd.to_datetime
for faster datetime conversion. For example, imagine you have this data:
In [1]: import pandas as pd
dates = pd.date_range('2015', freq='min', periods=1000)
dates = [d.strftime('%d %b %Y %H:%M:%S') for d in dates]
dates[:5]
Out[1]:
['01 Jan 2015 00:00:00',
'01 Jan 2015 00:01:00',
'01 Jan 2015 00:02:00',
'01 Jan 2015 00:03:00',
'01 Jan 2015 00:04:00']
You can get datetime objects this way:
In [2]: pd.to_datetime(dates[:5])
Out[2]:
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 00:01:00',
'2015-01-01 00:02:00', '2015-01-01 00:03:00',
'2015-01-01 00:04:00'],
dtype='datetime64[ns]', freq=None)
But this still can be slow in some cases. To be really fast on converting dates from strings where you know that all dates have the same format, you can specify the format
argument (e.g. here, format='%d %b %Y %H:%M:%S'
) or more automatically, use infer_datetime_format=True
so that the format will be inferred only once and used on the rest of the entries. This can result in some great speedups as the size of the array grows (but only works if all formats are consistent!).
For example, on these 1000 string dates I defined above:
from dateutil import parser
ser = pd.Series(dates)
%timeit ser.apply(lambda x: parser.parse(x))
10 loops, best of 3: 91.1 ms per loop
%timeit pd.to_datetime(dates)
10 loops, best of 3: 139 ms per loop
%timeit pd.to_datetime(dates, format='%d %b %Y %H:%M:%S')
100 loops, best of 3: 5.96 ms per loop
%timeit pd.to_datetime(dates, infer_datetime_format=True)
100 loops, best of 3: 6.79 ms per loop
We get about a factor of 20 speedup by specifying or inferring the datetime format in pd.to_datetime()
.
Upvotes: 4