Parseval
Parseval

Reputation: 843

Remove duplicates while imposing special conditions in R

I have the following tibble (only showing the first 10 rows out of ~2 million):

   ID              STATUS     NUMBER  FUNCTION LASTMODIFIED         AMOUNT YEAR   MONTH  DAY  
   <chr>           <chr>      <chr>   <chr>    <dttm>               <dbl>  <int>  <int>  <int>
 1 oQYKPAsu9j8AAAF APPROVED   "008"   CREDIT   2022-03-15 15:16:26  2401   2022   3      15   
 2 hhoKPAs_fjUAAAF APPROVED   "101"   CREDIT   2022-03-15 15:15:23  959    2022   3      15   
 3 Ip8KPAsj__4AAAF DENIED     "99"    LIMIT    2022-03-15 15:14:06  0      2022   3      15   
 4 wa4KPAstYwIAAAF DENIED     "99"    LIMIT    2022-03-15 15:13:36  0      2022   3      15   
 5 GucKPAssdaUAAAF APPROVED   "101"   LIMIT    2022-03-15 15:13:21  1084   2022   3      15   
 6 a6AKPAtAsx4AAAF DENIED     "101"   CREDIT   2022-03-15 15:12:02  699    2022   3      15   
 7 a6AKPAtAsx4AAAF DENIED     "101"   CREDIT   2022-03-15 15:12:34  699    2022   3      15   
 8 F4kKPAss7OAAAAF APPROVED   "101"   CREDIT   2022-03-15 15:10:25  3167   2022   3      15   
 9 MK4KPAstiEYAAAF DENIED     "99"    LIMIT    2022-03-15 15:08:46  0      2022   3      15   
10 .nUKPAs.crIAAAF APPROVED    NA     CREDIT   2022-03-15 15:08:35  58     2022   3      15   

This shows some operations that different users have done on a website, each ID is one unique customer. I want to remove duplicate entries that have taken place within x minutes of each other. So clearly only row 6 or 7 in the data above should be kept (preferably the first). Is there a neat tidyverse/dplyr way of doing this?

My first thought was to ignore the LASTMODIFIED column and use dg &>& filter(!duplicate()) but this will not do what I want.

Upvotes: 0

Views: 41

Answers (1)

r2evans
r2evans

Reputation: 161155

Assuming the data is already ordered by LASTMODIFIED (at least within each group), then

xseconds <- 600
dat %>%
  group_by(across(-LASTMODIFIED)) %>%
  filter(c(TRUE, as.numeric(diff(LASTMODIFIED), units="secs") > xseconds)) %>%
  ungroup()
# # A tibble: 9 x 9
#   ID              STATUS   NUMBER FUNCTION LASTMODIFIED        AMOUNT  YEAR MONTH   DAY
#   <chr>           <chr>     <int> <chr>    <dttm>               <int> <int> <int> <int>
# 1 oQYKPAsu9j8AAAF APPROVED      8 CREDIT   2022-03-15 15:16:26   2401  2022     3    15
# 2 hhoKPAs_fjUAAAF APPROVED    101 CREDIT   2022-03-15 15:15:23    959  2022     3    15
# 3 Ip8KPAsj__4AAAF DENIED       99 LIMIT    2022-03-15 15:14:06      0  2022     3    15
# 4 wa4KPAstYwIAAAF DENIED       99 LIMIT    2022-03-15 15:13:36      0  2022     3    15
# 5 GucKPAssdaUAAAF APPROVED    101 LIMIT    2022-03-15 15:13:21   1084  2022     3    15
# 6 a6AKPAtAsx4AAAF DENIED      101 CREDIT   2022-03-15 15:12:02    699  2022     3    15
# 7 F4kKPAss7OAAAAF APPROVED    101 CREDIT   2022-03-15 15:10:25   3167  2022     3    15
# 8 MK4KPAstiEYAAAF DENIED       99 LIMIT    2022-03-15 15:08:46      0  2022     3    15
# 9 .nUKPAs.crIAAAF APPROVED     NA CREDIT   2022-03-15 15:08:35     58  2022     3    15

Data

dat <- structure(list(ID = c("oQYKPAsu9j8AAAF", "hhoKPAs_fjUAAAF", "Ip8KPAsj__4AAAF", "wa4KPAstYwIAAAF", "GucKPAssdaUAAAF", "a6AKPAtAsx4AAAF", "a6AKPAtAsx4AAAF", "F4kKPAss7OAAAAF", "MK4KPAstiEYAAAF", ".nUKPAs.crIAAAF"), STATUS = c("APPROVED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "APPROVED"), NUMBER = c(8L, 101L, 99L, 99L, 101L, 101L, 101L, 101L, 99L, NA), FUNCTION = c("CREDIT", "CREDIT", "LIMIT", "LIMIT", "LIMIT", "CREDIT", "CREDIT", "CREDIT", "LIMIT", "CREDIT" ), LASTMODIFIED = structure(c(1647371786, 1647371723, 1647371646, 1647371616, 1647371601, 1647371522, 1647371554, 1647371425, 1647371326, 1647371315), class = c("POSIXct", "POSIXt"), tzone = ""), AMOUNT = c(2401L, 959L, 0L, 0L, 1084L, 699L, 699L, 3167L, 0L, 58L), YEAR = c(2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L), MONTH = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), DAY = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c("1", "2",  "3", "4", "5", "6", "7", "8", "9", "10"))

Upvotes: 2

Related Questions