K.-Michael Aye
K.-Michael Aye

Reputation: 5605

speed difference for pd.to_datetime for 2 different formats

I have 2 different date columns in my dataframe that I want to convert to datetime64.

One has the format

0    2009-03-09

and is converted rather fast:

%timeit pd.to_datetime(df.acquisition_date)
10000 loops, best of 3: 97.9 µs per loop

The other one has this format:

0    2013-01-07 01:02:38 UTC

and its conversion takes approx. 18 times longer:

%timeit pd.to_datetime(df.created_at)
1000 loops, best of 3: 1.74 ms per loop

What can I do to speed this up?

Edit: Some results from the suggestions (df.created_at is the slow format):

%timeit pd.to_datetime(df.acquisition_date)
%timeit pd.to_datetime(df.created_at)
%timeit pd.to_datetime(df.created_at, infer_datetime_format=True)
%timeit pd.to_datetime(df.created_at, format='%Y-%m-%d %H:%M:%S %Z')
10000 loops, best of 3: 98.5 µs per loop
1000 loops, best of 3: 1.73 ms per loop
1000 loops, best of 3: 955 µs per loop
1000 loops, best of 3: 222 µs per loop

Upvotes: 1

Views: 158

Answers (1)

U2EF1
U2EF1

Reputation: 13259

Make it explicit when you can.

 > dates = pd.Series(['2013-01-07 01:02:38 UTC']*100000)
 > %timeit pd.to_datetime(dates)
 ^C
 > %timeit pd.to_datetime(dates, format='%Y-%m-%d %H:%M:%S %Z')
 1 loops, best of 3: 570 ms per loop

Oddly, this seems to hurt the other kind.

> dates = pd.Series(['2009-03-09']*100000)
> %timeit pd.to_datetime(dates)
10 loops, best of 3: 22.2 ms per loop
> %timeit pd.to_datetime(dates, format='%Y-%m-%d')
1 loops, best of 3: 342 ms per loop

Note that the preferred form for your long-form timestamps gets us back up to speed:

> dates = pd.Series(['2013-01-07T01:02:38Z']*100000)
> %timeit pd.to_datetime(dates)
10 loops, best of 3: 23.1 ms per loop

Upvotes: 2

Related Questions