Reputation: 8950
I have a pandas dataframe with ca. 250,000 rows x 6 columns. One of the columns is a date, formatted as text. I need to do 3 things:
I do all of this with apply statements. They work, but seem slow to me: 7 seconds in total, whereas any SQL would take a fraction of a second, even without parallelisation, on the same machine. If this were a one-off I wouldn't invest time in speeding this up, but I must do it multiple times on multiple dataframes of similar size.
Is there any way I can speed up my code? Thanks a lot!
#this takes 3.1 seconds
df['date_reformatted'] = df['date_raw'].apply(lambda r: datetime.datetime.strptime(r, "%d/%m/%Y") )
# this takes 0.8 seconds
df['date_15']= df['date_reformatted'].apply(lambda r: datetime.date( r.year, r.month,15 ) )
# this takes 3.3 seconds
df['date_next_month']= df['date_15'].apply(lambda x: x + dateutil.relativedelta.relativedelta(months=1) )
Upvotes: 1
Views: 604
Reputation: 128948
In [51]: df = pd.DataFrame({'date_raw': pd.to_datetime(['2000-12-31']*250000)})
In [66]: %timeit pd.DataFrame({'date_raw': pd.to_datetime(['2000-12-31']*250000)})
10 loops, best of 3: 47.4 ms per loop
In [52]: df
Out[52]:
date_raw
0 2000-12-31
1 2000-12-31
2 2000-12-31
3 2000-12-31
4 2000-12-31
5 2000-12-31
... ...
249994 2000-12-31
249995 2000-12-31
249996 2000-12-31
249997 2000-12-31
249998 2000-12-31
249999 2000-12-31
[250000 rows x 1 columns]
In [53]: df['date'] = pd.DatetimeIndex(df.date_raw).to_period('M').to_timestamp('D') + pd.Timedelta('14d')
In [54]: df
Out[54]:
date_raw date
0 2000-12-31 2000-12-15
1 2000-12-31 2000-12-15
2 2000-12-31 2000-12-15
3 2000-12-31 2000-12-15
4 2000-12-31 2000-12-15
5 2000-12-31 2000-12-15
... ... ...
249994 2000-12-31 2000-12-15
249995 2000-12-31 2000-12-15
249996 2000-12-31 2000-12-15
249997 2000-12-31 2000-12-15
249998 2000-12-31 2000-12-15
249999 2000-12-31 2000-12-15
[250000 rows x 2 columns]
Timings
In [55]: %timeit pd.DatetimeIndex(df.date_raw).to_period('M').to_timestamp('D') + pd.Timedelta('14d')
10 loops, best of 3: 62.1 ms per loop
This will be a little more compact after this PR is merged. IOW,
pd.DatetimeIndex(df.date_raw).to_period('M').to_timestamp('15D')
Your question 3 is easy if you again convert to periods, then adding 1 adds to of the same freq, month in this case. This is also vectorized.
In [80]: df['date_plus_1'] = (pd.DatetimeIndex(df.date).to_period('M') + 1).to_timestamp('D') + pd.Timedelta('14d')
In [81]: df
Out[81]:
date_raw date date_plus_1
0 2000-12-31 2000-12-15 2001-01-15
1 2000-12-31 2000-12-15 2001-01-15
2 2000-12-31 2000-12-15 2001-01-15
3 2000-12-31 2000-12-15 2001-01-15
4 2000-12-31 2000-12-15 2001-01-15
5 2000-12-31 2000-12-15 2001-01-15
... ... ... ...
249994 2000-12-31 2000-12-15 2001-01-15
249995 2000-12-31 2000-12-15 2001-01-15
249996 2000-12-31 2000-12-15 2001-01-15
249997 2000-12-31 2000-12-15 2001-01-15
249998 2000-12-31 2000-12-15 2001-01-15
249999 2000-12-31 2000-12-15 2001-01-15
[250000 rows x 3 columns]
In [82]: %timeit (pd.DatetimeIndex(df.date).to_period('M') + 1).to_timestamp('D') + pd.Timedelta('14d')
10 loops, best of 3: 56.7 ms per loop
Upvotes: 2
Reputation: 109546
Try just using integers and strings. Only convert to datetime objects if you really need them.
%%timeit -n10 df = pd.DataFrame({'date_raw': ['31/12/2000']*250000})
_, months, years = zip(*df.date_raw.str.split('/'))
months_years = [(1 if m == '12' else int(m) + 1,
int(y) + 1 if m == '12' else int(y))
for m, y in zip(months, years)]
# New dates in dd-mm-yyyy format:
df['new_date'] = ['15-{0}-{1}'.format(x[0], x[1]) for x in months_years]
10 loops, best of 3: 583 ms per loop
>>> df.tail()
date_raw new_date
249995 31/12/2000 15-1-2001
249996 31/12/2000 15-1-2001
249997 31/12/2000 15-1-2001
249998 31/12/2000 15-1-2001
249999 31/12/2000 15-1-2001
The new dates are in text form (which is why it was fast). Creating datetime objects is a bit time consuming, but if you really need them:
%%timeit
df['new_date'].apply(lambda r: datetime.datetime.strptime(r, "%d-%m-%Y") )
1 loops, best of 3: 2.72 s per loop
Upvotes: 2
Reputation: 8906
Yes, you can do
df['date_formatted'] = pd.to_datetime(df['date_raw'], format= "%d/%m/%Y")
The second bit is a bit strange and I can't see how to vectorize it but you could get both columns in a single loop by
pd.DataFrame([(datetime.date(d.year, d.month, 15),
datetime.date(d.year, d.month + 1, 15)) for d in df.date_formatted],
columns=['date_15', 'date_next_month'])
might be a bit faster.
Upvotes: 2