Reputation: 23

Count occurrences in custom time frames

I am currently working with session logs and are interested in counting the number of occurrences of specific events in a custom time frame (first 1 [5, 10, after 10]) minute. To simplify: the start of a session is defined as the time of the first occurrence of a relevant event. I have already filtered the sessions by only the relevant events and the dataframe looks similar to this.

Input

import pandas as pd
data_in = {'SessionId': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
        'Timestamp': ['2020-08-24 12:46:30.726000+00:00', '2020-08-24 12:46:38.726000+00:00', '2020-08-24 12:49:30.726000+00:00', '2020-08-24 12:50:49.726000+00:00', '2020-08-24 12:58:30.726000+00:00', '2021-02-12 16:12:12.726000+00:00', '2021-02-12 16:15:24.726000+00:00', '2021-02-12 16:31:07.726000+00:00', '2020-12-03 23:58:17.726000+00:00', '2020-12-04 00:03:44.726000+00:00'],
        'event': ['match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match']
}
df_in = pd.DataFrame(data_in)
df_in

Desired Output:

data_out = {'SessionId': ['A', 'B', 'C'],
        '#events_first_1_minute': [2, 1, 1],
        '#events_first_5_minute': [4, 2, 1],
        '#events_first_10_minute': [4, 2, 2],
        '#events_after_10_minute': [5, 3, 2]
}
df_out = pd.DataFrame(data_out)
df_out

I already played around with groupby and pd.Grouper. I get the number of relevant events per session in total, but I don´t see any option for custom time bins. Another idea was also to get rid of the date part and focus only on the time, but there are of course also sessions that started on a day and ended on the other (SessionId: C).
Any help is appreciated!

Upvotes: 2

Answers (3)

mozway

Reputation: 260335

Using pandas.cut:

df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])

bins = ['1min', '5min', '10min']
bins2 = pd.to_timedelta(['0']+bins+['10000days'])
group = pd.cut(df_in.groupby('SessionId')['Timestamp'].apply(lambda x: x-x.min()),
               bins=bins2, labels=bins+['>'+bins[-1]]).fillna(bins[0])


(df_in
 .groupby(['SessionId', group]).size()
 .unstack(level=1)
 .cumsum(axis=1)
)

output:

Timestamp  1min  5min  10min  >10min
SessionId                           
A             2     4      4       5
B             1     2      2       3
C             1     1      2       2

Upvotes: 3

jezrael

Reputation: 862481

Use:

#convert values to datetimes
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])

#get minutes by substract minimal datetime per group
df_in['g']=(df_in['Timestamp'].sub(df_in.groupby('SessionId')['Timestamp'].transform('min'))
                              .dt.total_seconds().div(60)

#binning to intervals
lab = ['events_first_1_minute','events_first_5_minute','events_first_10_minute',
       'events_after_10_minute']
df_in['g'] = pd.cut(df_in['g'], 
                     bins=[0, 1, 5, 10, np.inf], 
                    labels=lab, include_lowest=True)

#count values with cumulative sums
df = (pd.crosstab(df_in['SessionId'], df_in['g'])
        .cumsum(axis=1)
        .rename(columns=str)
        .reset_index()
        .rename_axis(None, axis=1))
print (df)
  SessionId  events_first_1_minute  events_first_5_minute  \
0         A                      2                      4   
1         B                      1                      2   
2         C                      1                      1   

   events_first_10_minute  events_after_10_minute  
0                       4                       5  
1                       2                       3  
2                       2                       2

Upvotes: 1

Corralien

Reputation: 120391

First convert your Timestamp column to datetime64 dtype then group by SessionId and aggregate data:

# Not mandatory if it's already the case
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

out = df.groupby('SessionId')['Timestamp'] \
        .agg(**{'<1m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=1)),
                '<5m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=5)),
                '<10m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=10)),
                '>10m': 'size'}).reset_index()

Output:

>>> out
  SessionId  <1m  <5m  <10m  >10m
0         A    2    4     4     5
1         B    1    2     2     3
2         C    1    1     2     2

Upvotes: 2

Count occurrences in custom time frames

Answers (3)

Related Questions