wayneloo
wayneloo

Reputation: 251

Is there a way to vectorize this pandas apply method to make the code run faster?

Currently it is taking 40 - 50 mins to run ~220K rows

shop    timestamp   flag
10061   1577525275  NaN
10061   1577534732  NaN
10061   1577741715  NaN
10061   1577741800  NaN
10084   1577405286  NaN
def foo(row):
    criteria = (pd.isnull(df2.flag)) & (df.shop==row.shop) & (abs(df.timestamp-row.timestamp) <= 3600)
    df2.loc[criteria, 'flag'] = 1
df2 = df.copy()
df2.apply(foo, axis=1)

What I'm doing is for each row in the df2, I'm check against the main df to see if in the same shop, what other rows have timestamp within the hour and set the flag as 1. Is there a way to use numpy vectors to make this run much faster?

Expected output:

shop    timestamp   flag
10061   1577525275  NaN
10061   1577534732  NaN
10061   1577741715  1
10061   1577741800  1
10084   1577405286  NaN

Upvotes: 0

Views: 72

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

To do your task a quicker way, define the following function:

def newFlag(grp):
    tt = grp.timestamp
    ind = np.nonzero(np.triu(np.absolute(tt[np.newaxis, :] - tt[:, np.newaxis]) <= 3600, 1))
    tbl = grp.flag.values
    tbl[np.concatenate(ind)] = 1
    return pd.Series(np.where(np.isnan(grp.flag), tbl, grp.flag), index=grp.index)

Then apply it, saving the result in flag column:

df['flag'] = df.groupby('shop').apply(newFlag).reset_index(level=0, drop=True)

The speed of this solution is based on grouping by shop, so you don't have to compare rows concerning different shops.

Another important factor concerning the speed is employment of Numpy functions, which operate much quicker than Pandas.

To fully comprehend all details, run this code step-by-step for a chosen group of rows (for particular shop) and look at the results.

Upvotes: 1

Ebrahim Badrestani
Ebrahim Badrestani

Reputation: 86

You need to group your dataframe by shop, then sort each group by timestamp, and finally only check the previous and next rows:

print(df)
    shop   timestamp
0  10061  1577525275
1  10061  1577534732
2  10061  1577741715
3  10061  1577741800
4  10084  1577405286


def have_similar(df):
    df = df.sort_values('timestamp')
    df = df.assign(
        flag = (df.timestamp - df.timestamp.shift(1) < 3600) |
               (df.timestamp.shift(-1) - df.timestamp < 3600)
    )
    return df

result = df.groupby('shop').apply(have_similar).reset_index(drop=True)


print(result)
    shop   timestamp   flag
0  10061  1577525275  False
1  10061  1577534732  False
2  10061  1577741715   True
3  10061  1577741800   True
4  10084  1577405286  False

Upvotes: 0

Related Questions