Reputation: 33
My data has few 100k of rows and the following columns:
Time
(yyyy-MM-dd hh:mm:ss.ffffff),ID
(string),Group1
(int32),Group2
(int32).I would like to count how many events came from the same Group1
and Group2
prior to each event, in a time window of 5 minutes.
For example:
ID Time Group1 Group2
61ED2269CCAC 2020-07-27 00:01:05.781 1234 100123
61C2DC4E96FA 2020-07-27 00:01:17.279 1234 100123
FAD0839C1A95 2020-07-27 00:02:38.112 1234 100124
A2750A7B6C24 2020-07-27 00:16:50.592 4321 100123
03F5DF150A3C 2020-07-27 00:17:00.246 4321 100124
How many events came after Timestamp('2020-07-26 23:56:17.279000')
(5 minutes before the second event) and belong to groups Group1
and Group2
? So the second event would have a counter of 1 in this example. The rest will have a counter of 0 as their groups are unique.
Each event should have a counter indicating how many events came before it from the same groups.
I tried sorting the data by the groups and by Time
, then running a nested-loop, one running on all events and one running from the start up to the current event index. After few 1000 rows, the process slows down significantly, rendering this option non-feasible.
I was wondering if there is any other elegant and efficient way of doing this.
EDIT: I was able to do this with one for-loop instead of a nested one. For each loop I took the Time and the groups and sliced the Dataframe to include the events in the groups and in the desired time frame, and then summed the number of events:
for i in tqdm(range(len(df))):
time_stamp = df.loc[i, 'Time']
group1 = df.loc[i, 'Group1']
group2 = df.loc[i, 'Group2']
sub_df = df[df['Time'] + timedelta(minutes=-5) > time_stamp]
sub_df = sub_df[sub_df['Time'] < time_stamp]
sub_df = sub_df[sub_df['Group1'] == group1]
sub_df = sub_df[sub_df['Group2'] == group2]
df.loc[i, 'prior_events'] = sub_df.size
Still, tqdm shows 18 iteration per second, which is not that great for 100k rows.
Upvotes: 1
Views: 1064
Reputation: 30971
To get a more instructive result, I extended your data sample:
ID Time Group1 Group2
0 61ED2269CCAC 2020-07-27 00:01:05.781 1234 100123
1 61C2DC4E96FA 2020-07-27 00:01:17.279 1234 100123
2 FAD0839C1A95 2020-07-27 00:02:38.112 1234 100124
3 FAD0839C1A95 2020-07-27 00:05:38.000 1234 100123
4 FAD0839C1A95 2020-07-27 00:06:39.000 1234 100123
5 A2750A7B6C24 2020-07-27 00:16:50.592 4321 100123
6 03F5DF150A3C 2020-07-27 00:17:00.246 4321 100124
7 03F5DF150A3C 2020-07-27 00:18:00.000 4321 100124
8 03F5DF150A3C 2020-07-27 00:20:00.000 4321 100124
9 03F5DF150A3C 2020-07-27 00:22:00.000 4321 100124
10 03F5DF150A3C 2020-07-27 00:23:00.000 4321 100124
Assuming that Time column is of datetime type and its values are unique, you can generate the result (Count column) as follows:
df.set_index('Time', inplace=True)
df['Count'] = (df.groupby(['Group1', 'Group2'], as_index=False)\
.Group1.rolling(window='5T', closed='both').count() - 1).astype(int)\
.reset_index(level=0, drop=True)
df.reset_index(inplace=True)
The result is:
Time ID Group1 Group2 Count
0 2020-07-27 00:01:05.781 61ED2269CCAC 1234 100123 0
1 2020-07-27 00:01:17.279 61C2DC4E96FA 1234 100123 1
2 2020-07-27 00:02:38.112 FAD0839C1A95 1234 100124 0
3 2020-07-27 00:05:38.000 FAD0839C1A95 1234 100123 2
4 2020-07-27 00:06:39.000 FAD0839C1A95 1234 100123 1
5 2020-07-27 00:16:50.592 A2750A7B6C24 4321 100123 0
6 2020-07-27 00:17:00.246 03F5DF150A3C 4321 100124 0
7 2020-07-27 00:18:00.000 03F5DF150A3C 4321 100124 1
8 2020-07-27 00:20:00.000 03F5DF150A3C 4321 100124 2
9 2020-07-27 00:22:00.000 03F5DF150A3C 4321 100124 3
10 2020-07-27 00:23:00.000 03F5DF150A3C 4321 100124 3
Note the last row. Is has Count == 3, including the event just 5 minutes before. If you want this event not to be counted, drop closed='both' parameter.
Apparently your data have repeating Time values, even within a group of rows with same Group1 / Group2.
To cope with this, take another approach:
Define a function generating counts:
def Counts(grp):
vc = grp.Time.value_counts().sort_index()
cnt = (vc.rolling(window='5T', closed='both').sum()).astype(int) - vc
s = pd.Series(cnt, index=grp.Time)
return pd.Series(s.values, index=grp.index)
Apply it:
df['Counts'] = df.groupby(['Group1', 'Group2'], as_index=False)\
.apply(Counts).reset_index(level=0, drop=True)
This code is based on assumption that your source DataFrame is ordered by Time.
I tested the above code on a data sample with an added row with repeated Time from the previous row.
The result is:
ID Time Group1 Group2 Counts
0 61ED2269CCAC 2020-07-27 00:01:05.781 1234 100123 0
1 61C2DC4E96FA 2020-07-27 00:01:17.279 1234 100123 1
2 FAD0839C1A95 2020-07-27 00:02:38.112 1234 100124 0
3 FAD0839C1A95 2020-07-27 00:05:38.000 1234 100123 2
4 FAD0839C1A95 2020-07-27 00:06:39.000 1234 100123 1
5 A2750A7B6C24 2020-07-27 00:16:50.592 4321 100123 0
6 03F5DF150A3C 2020-07-27 00:17:00.246 4321 100124 0
7 03F5DF150A3C 2020-07-27 00:18:00.000 4321 100124 1
8 03F5DF150A3C 2020-07-27 00:20:00.000 4321 100124 2
9 03F5DF150A3C 2020-07-27 00:22:00.000 4321 100124 3
10 03F5DF150A3C 2020-07-27 00:23:00.000 4321 100124 3
11 03F5DF150BBB 2020-07-27 00:23:00.000 4321 100124 3
Upvotes: 2