Effectively removing rows from a Pandas DataFrame with groupby and temporal conditions?

Question

I have a dataframe with tens of millions of rows:

| userId | pageId | bannerId | timestap            |
|--------+--------+----------+---------------------|
| A      | P1     | B1       | 2020-10-10 01:00:00 |
| A      | P1     | B1       | 2020-10-10 01:00:10 |
| B      | P1     | B1       | 2020-10-10 01:00:00 |
| B      | P2     | B2       | 2020-10-10 02:00:00 |

What I'd like to do is remove all rows where for the same userId, pageId, bannerId, timestamp is within n minutes of the previous occurrence of that same userId, pageId, bannerId pair.

What I'm doing now:

# Get all instances of `userId, pageId, bannerId` that repeats,
# although, not all of them will have repeated within the `n` minute
# threshold I'm interested in.
groups = in df.groupby(['userId', 'pageId', 'bannerId']).userId.count()

# Iterate through each group, and manually check if the repetition was
# within `n` minutes. Keep track of all IDs to be removed.
to_remove = []
for user_id, page_id, banner_id in groups.index:
   sub = df.loc[
      (df.userId == user_id) &
      (df.pageId == pageId) &
      (df.bannerId == bannerId)
   ].sort_values('timestamp')

   # Now that each occurrence is listed chronologically,
   # check time diff.
   sub = sub.loc[
     ((sub.timestamp.shift(1) - sub.timestamp) / pd.Timedelta(minutes=1)).abs() <= n
   ]

   if sub.shape[0] > 0:
      to_remove += sub.index.tolist()

This does work as I'd like. Only issue is that with the large amount of data I have, it takes hours to complete.

Effectively removing rows from a Pandas DataFrame with groupby and temporal conditions?

Answers (1)

Related Questions