Reputation: 13088
I have a dataframe with tens of millions of rows:
| userId | pageId | bannerId | timestap |
|--------+--------+----------+---------------------|
| A | P1 | B1 | 2020-10-10 01:00:00 |
| A | P1 | B1 | 2020-10-10 01:00:10 |
| B | P1 | B1 | 2020-10-10 01:00:00 |
| B | P2 | B2 | 2020-10-10 02:00:00 |
What I'd like to do is remove all rows where for the same userId, pageId, bannerId
, timestamp is within n
minutes of the previous occurrence of that same userId, pageId, bannerId
pair.
What I'm doing now:
# Get all instances of `userId, pageId, bannerId` that repeats,
# although, not all of them will have repeated within the `n` minute
# threshold I'm interested in.
groups = in df.groupby(['userId', 'pageId', 'bannerId']).userId.count()
# Iterate through each group, and manually check if the repetition was
# within `n` minutes. Keep track of all IDs to be removed.
to_remove = []
for user_id, page_id, banner_id in groups.index:
sub = df.loc[
(df.userId == user_id) &
(df.pageId == pageId) &
(df.bannerId == bannerId)
].sort_values('timestamp')
# Now that each occurrence is listed chronologically,
# check time diff.
sub = sub.loc[
((sub.timestamp.shift(1) - sub.timestamp) / pd.Timedelta(minutes=1)).abs() <= n
]
if sub.shape[0] > 0:
to_remove += sub.index.tolist()
This does work as I'd like. Only issue is that with the large amount of data I have, it takes hours to complete.
Upvotes: 0
Views: 59
Reputation: 30991
To get a more instructive result, I took a bit longer source DataFrame:
userId pageId bannerId timestap
0 A P1 B1 2020-10-10 01:00:00
1 A P1 B1 2020-10-10 01:04:10
2 A P1 B1 2020-10-10 01:05:00
3 A P1 B1 2020-10-10 01:08:20
4 A P1 B1 2020-10-10 01:09:30
5 A P1 B1 2020-10-10 01:11:00
6 B P1 B1 2020-10-10 01:00:00
7 B P2 B2 2020-10-10 02:00:00
Note: timestap column is of datetime type.
Start from defining a "filtering" function for a group of timestap values (for some combination of userId, pageId and bannerId):
def myFilter(grp, nMin):
prevTs = np.nan
grp = grp.sort_values()
res = []
for ts in grp:
if pd.isna(prevTs) or (ts - prevTs) / pd.Timedelta(1, 'm') >= nMin:
prevTs = ts
res.append(ts)
return res
Then set the time threshold (the number of minutes):
nMin = 5
And the last thing is to generate the result:
result = df.groupby(['userId', 'pageId', 'bannerId'])\
.timestap.apply(myFilter, nMin).explode().reset_index()
For my data sample, the result is:
userId pageId bannerId timestap
0 A P1 B1 2020-10-10 01:00:00
1 A P1 B1 2020-10-10 01:05:00
2 A P1 B1 2020-10-10 01:11:00
3 B P1 B1 2020-10-10 01:00:00
4 B P2 B2 2020-10-10 02:00:00
Note that "ordinary" diff is not enough, because eg. starting from the row with timestamp 01:05:00, two following rows (01:08:20 and 01:09:30) should be dropped, as they are within 5 minutes limit from 01:05:00.
So it is not enough to look at the previous row only. Starting from some row you should "mark for drop" all following rows until you find a row with the timestamp more or at least equally distant from the "start row" than the limit. And in this case just this rows becomes the starting row for analysis of following rows (within the current group).
Upvotes: 1