Reputation: 2785
I have a dataframe with two columns, each one formed by a set of dates. I want to compute the difference between dates and return the the number of days. However, the process (described above) is very slow. Does anyone knows how to accelerate the process? This code is being used in a big file and speed is important.
dfx = pd.DataFrame([[datetime(2014,1,1), datetime(2014,1,10)],[datetime(2014,1,1), datetime(2015,1,10)],[datetime(2013,1,1), datetime(2014,1,12)]], columns = ['x', 'y'])
dfx['diffx'] = dfx['y']-dfx['x']
dfx['diff'] = dfx['diffx'].apply(lambda x: x.days)
dfx
Final goal:
Upvotes: 2
Views: 1068
Reputation: 164693
You may find a marginal massive speed-up dropping down to NumPy, bypassing the overhead associated with pd.Series
objects.
See also pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?.
# Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3
def days_lambda(dfx):
return (dfx['y']-dfx['x']).apply(lambda x: x.days)
def days_pd(dfx):
return (dfx['y']-dfx['x']).dt.days
def days_np(dfx):
return (dfx['y'].values-dfx['x'].values) / np.timedelta64(1, 'D')
# check results are identical
assert (days_lambda(dfx).values == days_pd(dfx).values).all()
assert (days_lambda(dfx).values == days_np(dfx)).all()
dfx = pd.concat([dfx]*100000)
%timeit days_lambda(dfx) # 5.02 s per loop
%timeit days_pd(dfx) # 5.6 s per loop
%timeit days_np(dfx) # 4.72 ms per loop
Upvotes: 2