Reputation: 251
Currently it is taking 40 - 50 mins to run ~220K rows
shop timestamp flag
10061 1577525275 NaN
10061 1577534732 NaN
10061 1577741715 NaN
10061 1577741800 NaN
10084 1577405286 NaN
def foo(row):
criteria = (pd.isnull(df2.flag)) & (df.shop==row.shop) & (abs(df.timestamp-row.timestamp) <= 3600)
df2.loc[criteria, 'flag'] = 1
df2 = df.copy()
df2.apply(foo, axis=1)
What I'm doing is for each row in the df2, I'm check against the main df to see if in the same shop, what other rows have timestamp within the hour and set the flag as 1. Is there a way to use numpy vectors to make this run much faster?
Expected output:
shop timestamp flag
10061 1577525275 NaN
10061 1577534732 NaN
10061 1577741715 1
10061 1577741800 1
10084 1577405286 NaN
Upvotes: 0
Views: 72
Reputation: 31011
To do your task a quicker way, define the following function:
def newFlag(grp):
tt = grp.timestamp
ind = np.nonzero(np.triu(np.absolute(tt[np.newaxis, :] - tt[:, np.newaxis]) <= 3600, 1))
tbl = grp.flag.values
tbl[np.concatenate(ind)] = 1
return pd.Series(np.where(np.isnan(grp.flag), tbl, grp.flag), index=grp.index)
Then apply it, saving the result in flag column:
df['flag'] = df.groupby('shop').apply(newFlag).reset_index(level=0, drop=True)
The speed of this solution is based on grouping by shop, so you don't have to compare rows concerning different shops.
Another important factor concerning the speed is employment of Numpy functions, which operate much quicker than Pandas.
To fully comprehend all details, run this code step-by-step for a chosen group of rows (for particular shop) and look at the results.
Upvotes: 1
Reputation: 86
You need to group your dataframe by shop, then sort each group by timestamp, and finally only check the previous and next rows:
print(df)
shop timestamp
0 10061 1577525275
1 10061 1577534732
2 10061 1577741715
3 10061 1577741800
4 10084 1577405286
def have_similar(df):
df = df.sort_values('timestamp')
df = df.assign(
flag = (df.timestamp - df.timestamp.shift(1) < 3600) |
(df.timestamp.shift(-1) - df.timestamp < 3600)
)
return df
result = df.groupby('shop').apply(have_similar).reset_index(drop=True)
print(result)
shop timestamp flag
0 10061 1577525275 False
1 10061 1577534732 False
2 10061 1577741715 True
3 10061 1577741800 True
4 10084 1577405286 False
Upvotes: 0