Is there a way to vectorize this pandas apply method to make the code run faster?

Question

Currently it is taking 40 - 50 mins to run ~220K rows

shop    timestamp   flag
10061   1577525275  NaN
10061   1577534732  NaN
10061   1577741715  NaN
10061   1577741800  NaN
10084   1577405286  NaN

def foo(row):
    criteria = (pd.isnull(df2.flag)) & (df.shop==row.shop) & (abs(df.timestamp-row.timestamp) <= 3600)
    df2.loc[criteria, 'flag'] = 1

df2 = df.copy()
df2.apply(foo, axis=1)

What I'm doing is for each row in the df2, I'm check against the main df to see if in the same shop, what other rows have timestamp within the hour and set the flag as 1. Is there a way to use numpy vectors to make this run much faster?

Expected output:

shop    timestamp   flag
10061   1577525275  NaN
10061   1577534732  NaN
10061   1577741715  1
10061   1577741800  1
10084   1577405286  NaN

Valdi_Bo · Accepted Answer

To do your task a quicker way, define the following function:

def newFlag(grp):
    tt = grp.timestamp
    ind = np.nonzero(np.triu(np.absolute(tt[np.newaxis, :] - tt[:, np.newaxis]) <= 3600, 1))
    tbl = grp.flag.values
    tbl[np.concatenate(ind)] = 1
    return pd.Series(np.where(np.isnan(grp.flag), tbl, grp.flag), index=grp.index)

Then apply it, saving the result in flag column:

df['flag'] = df.groupby('shop').apply(newFlag).reset_index(level=0, drop=True)

The speed of this solution is based on grouping by shop, so you don't have to compare rows concerning different shops.

Another important factor concerning the speed is employment of Numpy functions, which operate much quicker than Pandas.

To fully comprehend all details, run this code step-by-step for a chosen group of rows (for particular shop) and look at the results.

Is there a way to vectorize this pandas apply method to make the code run faster?

Answers (2)

Related Questions