nunodsousa
nunodsousa

Reputation: 2785

Conversion of a timedelta to int very slow in python

I have a dataframe with two columns, each one formed by a set of dates. I want to compute the difference between dates and return the the number of days. However, the process (described above) is very slow. Does anyone knows how to accelerate the process? This code is being used in a big file and speed is important.

dfx = pd.DataFrame([[datetime(2014,1,1), datetime(2014,1,10)],[datetime(2014,1,1), datetime(2015,1,10)],[datetime(2013,1,1),  datetime(2014,1,12)]], columns = ['x', 'y'])

enter image description here

dfx['diffx'] = dfx['y']-dfx['x']
dfx['diff'] = dfx['diffx'].apply(lambda x: x.days)
dfx

Final goal:

enter image description here

Upvotes: 2

Views: 1068

Answers (1)

jpp
jpp

Reputation: 164693

You may find a marginal massive speed-up dropping down to NumPy, bypassing the overhead associated with pd.Series objects.

See also pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?.

# Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3

def days_lambda(dfx):
    return (dfx['y']-dfx['x']).apply(lambda x: x.days)

def days_pd(dfx):
    return (dfx['y']-dfx['x']).dt.days

def days_np(dfx):
    return (dfx['y'].values-dfx['x'].values) / np.timedelta64(1, 'D')

# check results are identical
assert (days_lambda(dfx).values == days_pd(dfx).values).all()
assert (days_lambda(dfx).values == days_np(dfx)).all()

dfx = pd.concat([dfx]*100000)

%timeit days_lambda(dfx)  # 5.02 s per loop
%timeit days_pd(dfx)      # 5.6 s per loop
%timeit days_np(dfx)      # 4.72 ms per loop

Upvotes: 2

Related Questions