Reputation: 62634
I have the following pandas dataframe format read into a CSV under variable "mydataframe" (that is sorted by RegisterTime):
Name, RegistrationID, RegisterTime
Jordan, 1, 2017-08-01T00:00:05
Jordan, 2, 2017-08-01T00:00:08
Jordan, 3, 2017-08-01T00:00:10
Sarah, 4, 2017-08-01T00:00:15
Jordan, 42, 2017-08-01T00:00:16
Sarah, 54, 2017-08-01T00:00:20
Jordan, 53, 2017-08-01T00:00:30
Jordan, 55, 2017-08-01T00:00:32
The dtypes for my 'RegisterTime' is 'object', and when printed shows something like this '2017-08-01T00:00:15'
.
I want to be able to 'group' similar actions performed by users in on sitting within 10 seconds of each action into a single group. I want to make it so that I am able to assign a new column for 'ActionGroup' that is assigned with an incremental value based on this.
Sample output dataframe:
Name, RegistrationID, RegisterTime, ActionGroup
Jordan, 1, 2017-08-01T00:00:05, 1
Jordan, 2, 2017-08-01T00:00:08, 1
Jordan, 3, 2017-08-01T00:00:10, 1
Sarah, 4, 2017-08-01T00:00:15, 2
Jordan, 42, 2017-08-01T00:00:16, 1
Sarah, 54, 2017-08-01T00:00:20, 2
Jordan, 53, 2017-08-01T00:00:30, 3
Jordan, 55, 2017-08-01T00:00:32, 3
What is the best way to go about this (adding the ActionGroup column for the dataframe)?
Upvotes: 3
Views: 158
Reputation: 11602
Here is a slight rewrite of the answer by Chris, making use of a "global" counter.
from itertools import count
ctr = count(1)
df['ActionGroup'] = df.groupby('Name').RegisterTime.transform(
# Processing each Name.
# True if row starts an ActionGroup.
lambda x: x.diff().dt.seconds.fillna(close + 1).gt(close)
# New Action ID if row starts an Action Group.
.apply(lambda b: next(ctr) if b else 0)
# forwards fill replacing zeros with
# previous non-zero ActionGroup ID
.replace(to_replace=0, method='ffill')
)
In the end, this calls a pure Python function for each row, so one should not expect great performance. If speed is an issue, one could define a global counter using numba.jitclass
that converts boolean arrays into labelled arrays.
Upvotes: 0
Reputation: 29742
Not the clearest way, but works:
import pandas as pd
df['RegisterTime'] = pd.to_datetime(df['RegisterTime'])
s = df.groupby('Name').apply(lambda x: x['RegisterTime'].diff().dt.seconds.fillna(0).gt(10).cumsum().astype(str) + x['Name'])
df['ActionGroup'] = df.groupby(s.droplevel(0)).ngroup() + 1
print(df)
Output:
Name RegistrationID RegisterTime ActionGroup
0 Jordan 1 2017-08-01 00:00:05 1
1 Jordan 2 2017-08-01 00:00:08 1
2 Jordan 3 2017-08-01 00:00:10 1
3 Sarah 4 2017-08-01 00:00:15 2
4 Jordan 42 2017-08-01 00:00:16 1
5 Sarah 54 2017-08-01 00:00:20 2
6 Jordan 53 2017-08-01 00:00:30 3
7 Jordan 55 2017-08-01 00:00:32 3
Upvotes: 2