Conversion of a timedelta to int very slow in python

Question

I have a dataframe with two columns, each one formed by a set of dates. I want to compute the difference between dates and return the the number of days. However, the process (described above) is very slow. Does anyone knows how to accelerate the process? This code is being used in a big file and speed is important.

dfx = pd.DataFrame([[datetime(2014,1,1), datetime(2014,1,10)],[datetime(2014,1,1), datetime(2015,1,10)],[datetime(2013,1,1),  datetime(2014,1,12)]], columns = ['x', 'y'])

dfx['diffx'] = dfx['y']-dfx['x']
dfx['diff'] = dfx['diffx'].apply(lambda x: x.days)
dfx

Final goal:

jpp · Accepted Answer

You may find a ~~marginal~~ massive speed-up dropping down to NumPy, bypassing the overhead associated with pd.Series objects.

See also pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?.

# Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3

def days_lambda(dfx):
    return (dfx['y']-dfx['x']).apply(lambda x: x.days)

def days_pd(dfx):
    return (dfx['y']-dfx['x']).dt.days

def days_np(dfx):
    return (dfx['y'].values-dfx['x'].values) / np.timedelta64(1, 'D')

# check results are identical
assert (days_lambda(dfx).values == days_pd(dfx).values).all()
assert (days_lambda(dfx).values == days_np(dfx)).all()

dfx = pd.concat([dfx]*100000)

%timeit days_lambda(dfx)  # 5.02 s per loop
%timeit days_pd(dfx)      # 5.6 s per loop
%timeit days_np(dfx)      # 4.72 ms per loop

Conversion of a timedelta to int very slow in python

Answers (1)

Related Questions