Reputation: 341
I'm having an issue working out a rolling count of transactions applicable to each individual buyer in this dataset structured as follows:
userID itemID transaction_ts
3229 4493320 2016-01-02 14:55:00
3229 4492492 2016-01-02 14:57:02
3229 4496756 2016-01-04 09:01:18
3229 4493673 2016-01-04 09:11:10
3229 4497531 2016-01-04 11:05:25
3229 4495006 2016-01-05 07:25:11
4330 4500695 2016-01-02 09:17:21
4330 4500656 2016-01-03 09:19:28
4330 4503087 2016-01-04 07:42:15
4330 4501846 2016-01-04 08:55:24
4330 4504105 2016-01-04 09:59:35
Ideally, it would look like the below for a rolling transaction count window of e.g. 24 hours:
userID itemID transaction_ts rolling_count
3229 4493320 2016-01-02 14:55:00 1
3229 4492492 2016-01-02 14:57:02 2
3229 4496756 2016-01-04 09:01:18 1
3229 4493673 2016-01-04 09:11:10 2
3229 4497531 2016-01-04 11:05:25 3
3229 4495006 2016-01-05 07:25:11 4
4330 4500695 2016-01-02 09:17:21 1
4330 4500656 2016-01-03 09:19:28 1
4330 4503087 2016-01-04 07:42:15 2
4330 4501846 2016-01-04 08:55:24 3
4330 4504105 2016-01-04 09:59:35 3
There is an excellent answer to a similar problem here: pandas rolling sum of last five minutes
However, this answer depends solely on the timestamp field, unlike the above where the rolling count must reset to 1 upon encountering a transaction from a different user to that of the row above. It is possible to find a solution via slicing but given the size of this dataset (potentially 1m+ rows) that is not feasible.
Crucially, the window should reflect the 24 hour period prior to the transactional_ts of the respective row, hence why I think a custom df.apply or rolling_window method is appropriate, I just can't figure out how to make that conditional on the userID.
Upvotes: 7
Views: 3596
Reputation: 341
I managed to get a solution, which works on the test set at least. ptjr got there first though! The first solution on this problem Pandas Rolling Computations on Sliding Windows (Unevenly spaced) helped out a lot.
As ptrj earlier pointed out - using df.groupby('userID') is the key.
df = pd.read_excel('velocity.xlsx') # reading dataframe in
df = df.sort_values(['userID','transaction_ts'])
df = df.reset_index(drop=True) # ensure index is sorted according to userID|transaction_ts
df['ones'] = 1
def add_rolling_count(x,number_of_hours):
x['lag'] = x['transaction_ts'] - timedelta(hours=number_of_hours)
inds = np.searchsorted(np.array(x['transaction_ts'].astype(np.int64)), np.array(x['lag'].astype(np.int64)))
cs = x['ones'].reset_index(drop=True).cumsum()
x['count'] = cs.values - cs[inds].values + 1
return x`
df = df.groupby('user_id').apply(lambda x: add_rolling_count(x, 24))
Upvotes: 0
Reputation: 5212
A part of the solution (a rolling cumsum) may already be here. (I only changed the type of lag):
from datetime import timedelta
def msum(s, lag):
lag = s.index - timedelta(days=lag)
inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
cs = s.cumsum()
return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)
The function requires an index to be of datetime type. Moreover, the index within each userID group should be already sorted (for instance as in your example).
df = df.set_index('transaction_ts')
df['rolling_count'] = 1
df['rolling_count'] = df.groupby('userID', sort=False)['rolling_count'].transform(lambda x : msum(x,1))
A groupby option sort=False
may give some speed up. (It's responsible for sorting group keys.)
Upvotes: 4