Reputation: 23
I am currently working with session logs and are interested in counting the number of occurrences of specific events in a custom time frame (first 1 [5, 10, after 10]) minute. To simplify: the start of a session is defined as the time of the first occurrence of a relevant event. I have already filtered the sessions by only the relevant events and the dataframe looks similar to this.
Input
import pandas as pd
data_in = {'SessionId': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'Timestamp': ['2020-08-24 12:46:30.726000+00:00', '2020-08-24 12:46:38.726000+00:00', '2020-08-24 12:49:30.726000+00:00', '2020-08-24 12:50:49.726000+00:00', '2020-08-24 12:58:30.726000+00:00', '2021-02-12 16:12:12.726000+00:00', '2021-02-12 16:15:24.726000+00:00', '2021-02-12 16:31:07.726000+00:00', '2020-12-03 23:58:17.726000+00:00', '2020-12-04 00:03:44.726000+00:00'],
'event': ['match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match']
}
df_in = pd.DataFrame(data_in)
df_in
Desired Output:
data_out = {'SessionId': ['A', 'B', 'C'],
'#events_first_1_minute': [2, 1, 1],
'#events_first_5_minute': [4, 2, 1],
'#events_first_10_minute': [4, 2, 2],
'#events_after_10_minute': [5, 3, 2]
}
df_out = pd.DataFrame(data_out)
df_out
I already played around with groupby and pd.Grouper. I get the number of relevant events per session in total, but I don´t see any option for custom time bins. Another idea was also to get rid of the date part and focus only on the time, but there are of course also sessions that started on a day and ended on the other (SessionId: C).
Any help is appreciated!
Upvotes: 2
Views: 398
Reputation: 260335
Using pandas.cut
:
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])
bins = ['1min', '5min', '10min']
bins2 = pd.to_timedelta(['0']+bins+['10000days'])
group = pd.cut(df_in.groupby('SessionId')['Timestamp'].apply(lambda x: x-x.min()),
bins=bins2, labels=bins+['>'+bins[-1]]).fillna(bins[0])
(df_in
.groupby(['SessionId', group]).size()
.unstack(level=1)
.cumsum(axis=1)
)
output:
Timestamp 1min 5min 10min >10min
SessionId
A 2 4 4 5
B 1 2 2 3
C 1 1 2 2
Upvotes: 3
Reputation: 862481
Use:
#convert values to datetimes
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])
#get minutes by substract minimal datetime per group
df_in['g']=(df_in['Timestamp'].sub(df_in.groupby('SessionId')['Timestamp'].transform('min'))
.dt.total_seconds().div(60)
#binning to intervals
lab = ['events_first_1_minute','events_first_5_minute','events_first_10_minute',
'events_after_10_minute']
df_in['g'] = pd.cut(df_in['g'],
bins=[0, 1, 5, 10, np.inf],
labels=lab, include_lowest=True)
#count values with cumulative sums
df = (pd.crosstab(df_in['SessionId'], df_in['g'])
.cumsum(axis=1)
.rename(columns=str)
.reset_index()
.rename_axis(None, axis=1))
print (df)
SessionId events_first_1_minute events_first_5_minute \
0 A 2 4
1 B 1 2
2 C 1 1
events_first_10_minute events_after_10_minute
0 4 5
1 2 3
2 2 2
Upvotes: 1
Reputation: 120391
First convert your Timestamp
column to datetime64
dtype then group by SessionId
and aggregate data:
# Not mandatory if it's already the case
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
out = df.groupby('SessionId')['Timestamp'] \
.agg(**{'<1m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=1)),
'<5m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=5)),
'<10m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=10)),
'>10m': 'size'}).reset_index()
Output:
>>> out
SessionId <1m <5m <10m >10m
0 A 2 4 4 5
1 B 1 2 2 3
2 C 1 1 2 2
Upvotes: 2