Olivier
Olivier

Reputation: 3

Grouping data based on time interval

I have to group a dataset with multiple participants. The participants work a specific time on a specific tablet. If rows are the same tablet, and the time difference between consecutive rows is no more than 10 minutes, the rows belong to one participant. I would like to create a new column ("Participant") that numbers the participants. I know some python but this goes over my head. Thanks a lot!

Dataframe:

ID, Time, Tablet
1, 9:12, a
2, 9:14, a
3, 9:17, a
4, 9:45, a
5, 9:49, a
6, 9:51, a
7, 9:13, b 
8, 9:15, b
...

Goal:

ID, Time, Tablet, Participant
1, 9:12, a, 1
2, 9:14, a, 1
3, 9:17, a, 1
4, 9:45, a, 2
5, 9:49, a, 2
6, 9:51, a, 2
7, 9:13, b, 3
8, 9:15, b, 3
...

Upvotes: 0

Views: 592

Answers (2)

XXavier
XXavier

Reputation: 1226

You can groupby first then do a cumsum to get the participant column the way you want. Please make sure the time column is in datetime format and also sort it before you do this.

df['time'] = pd.to_datetime(df['time']) 
df['time_diff']=df.groupby(['tablet'])['time'].diff().dt.seconds/60 
df['participant'] = np.where((df['time_diff'].isnull()) | (df['time_diff']>10), 1,0).cumsum() 

Upvotes: 1

Fred
Fred

Reputation: 502

I've done something similar before, I used a combination of a group_by statement and using the Pandas shift function.

df = df.sort_values(["Tablet", "Time"])
df["Time_Period"] = df.groupby("Tablet")["Time"].shift(-1)-df["Time"]
df["Time_Period"] = df["Time_Period"].dt.total_seconds()
df["New_Participant"] = df["Time_Period"] > 10*60 #10 Minutes
df["Participant_ID"] = df["New_Participant"].cumsum()

Basically I flag every time there is a gap of over 10 minutes between sessions, then do a rolling sum to give each participant a unique ID

Upvotes: 0

Related Questions