randuser2843
randuser2843

Reputation: 23

Count occurrences in custom time frames

I am currently working with session logs and are interested in counting the number of occurrences of specific events in a custom time frame (first 1 [5, 10, after 10]) minute. To simplify: the start of a session is defined as the time of the first occurrence of a relevant event. I have already filtered the sessions by only the relevant events and the dataframe looks similar to this.

Input

import pandas as pd
data_in = {'SessionId': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
        'Timestamp': ['2020-08-24 12:46:30.726000+00:00', '2020-08-24 12:46:38.726000+00:00', '2020-08-24 12:49:30.726000+00:00', '2020-08-24 12:50:49.726000+00:00', '2020-08-24 12:58:30.726000+00:00', '2021-02-12 16:12:12.726000+00:00', '2021-02-12 16:15:24.726000+00:00', '2021-02-12 16:31:07.726000+00:00', '2020-12-03 23:58:17.726000+00:00', '2020-12-04 00:03:44.726000+00:00'],
        'event': ['match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match']
}
df_in = pd.DataFrame(data_in)
df_in

Desired Output:

data_out = {'SessionId': ['A', 'B', 'C'],
        '#events_first_1_minute': [2, 1, 1],
        '#events_first_5_minute': [4, 2, 1],
        '#events_first_10_minute': [4, 2, 2],
        '#events_after_10_minute': [5, 3, 2]
}
df_out = pd.DataFrame(data_out)
df_out

I already played around with groupby and pd.Grouper. I get the number of relevant events per session in total, but I don´t see any option for custom time bins. Another idea was also to get rid of the date part and focus only on the time, but there are of course also sessions that started on a day and ended on the other (SessionId: C).
Any help is appreciated!

Upvotes: 2

Views: 398

Answers (3)

mozway
mozway

Reputation: 260335

Using pandas.cut:

df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])

bins = ['1min', '5min', '10min']
bins2 = pd.to_timedelta(['0']+bins+['10000days'])
group = pd.cut(df_in.groupby('SessionId')['Timestamp'].apply(lambda x: x-x.min()),
               bins=bins2, labels=bins+['>'+bins[-1]]).fillna(bins[0])


(df_in
 .groupby(['SessionId', group]).size()
 .unstack(level=1)
 .cumsum(axis=1)
)

output:

Timestamp  1min  5min  10min  >10min
SessionId                           
A             2     4      4       5
B             1     2      2       3
C             1     1      2       2

Upvotes: 3

jezrael
jezrael

Reputation: 862481

Use:

#convert values to datetimes
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])

#get minutes by substract minimal datetime per group
df_in['g']=(df_in['Timestamp'].sub(df_in.groupby('SessionId')['Timestamp'].transform('min'))
                              .dt.total_seconds().div(60)

#binning to intervals
lab = ['events_first_1_minute','events_first_5_minute','events_first_10_minute',
       'events_after_10_minute']
df_in['g'] = pd.cut(df_in['g'], 
                     bins=[0, 1, 5, 10, np.inf], 
                    labels=lab, include_lowest=True)

#count values with cumulative sums
df = (pd.crosstab(df_in['SessionId'], df_in['g'])
        .cumsum(axis=1)
        .rename(columns=str)
        .reset_index()
        .rename_axis(None, axis=1))
print (df)
  SessionId  events_first_1_minute  events_first_5_minute  \
0         A                      2                      4   
1         B                      1                      2   
2         C                      1                      1   

   events_first_10_minute  events_after_10_minute  
0                       4                       5  
1                       2                       3  
2                       2                       2  

Upvotes: 1

Corralien
Corralien

Reputation: 120391

First convert your Timestamp column to datetime64 dtype then group by SessionId and aggregate data:

# Not mandatory if it's already the case
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

out = df.groupby('SessionId')['Timestamp'] \
        .agg(**{'<1m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=1)),
                '<5m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=5)),
                '<10m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=10)),
                '>10m': 'size'}).reset_index()

Output:

>>> out
  SessionId  <1m  <5m  <10m  >10m
0         A    2    4     4     5
1         B    1    2     2     3
2         C    1    1     2     2

Upvotes: 2

Related Questions