Reputation: 5605
I have 2 different date columns in my dataframe that I want to convert to datetime64.
One has the format
0 2009-03-09
and is converted rather fast:
%timeit pd.to_datetime(df.acquisition_date)
10000 loops, best of 3: 97.9 µs per loop
The other one has this format:
0 2013-01-07 01:02:38 UTC
and its conversion takes approx. 18 times longer:
%timeit pd.to_datetime(df.created_at)
1000 loops, best of 3: 1.74 ms per loop
What can I do to speed this up?
Edit:
Some results from the suggestions (df.created_at
is the slow
format):
%timeit pd.to_datetime(df.acquisition_date)
%timeit pd.to_datetime(df.created_at)
%timeit pd.to_datetime(df.created_at, infer_datetime_format=True)
%timeit pd.to_datetime(df.created_at, format='%Y-%m-%d %H:%M:%S %Z')
10000 loops, best of 3: 98.5 µs per loop
1000 loops, best of 3: 1.73 ms per loop
1000 loops, best of 3: 955 µs per loop
1000 loops, best of 3: 222 µs per loop
Upvotes: 1
Views: 158
Reputation: 13259
Make it explicit when you can.
> dates = pd.Series(['2013-01-07 01:02:38 UTC']*100000)
> %timeit pd.to_datetime(dates)
^C
> %timeit pd.to_datetime(dates, format='%Y-%m-%d %H:%M:%S %Z')
1 loops, best of 3: 570 ms per loop
Oddly, this seems to hurt the other kind.
> dates = pd.Series(['2009-03-09']*100000)
> %timeit pd.to_datetime(dates)
10 loops, best of 3: 22.2 ms per loop
> %timeit pd.to_datetime(dates, format='%Y-%m-%d')
1 loops, best of 3: 342 ms per loop
Note that the preferred form for your long-form timestamps gets us back up to speed:
> dates = pd.Series(['2013-01-07T01:02:38Z']*100000)
> %timeit pd.to_datetime(dates)
10 loops, best of 3: 23.1 ms per loop
Upvotes: 2