Reputation: 843
I have the following tibble (only showing the first 10 rows out of ~2 million):
ID STATUS NUMBER FUNCTION LASTMODIFIED AMOUNT YEAR MONTH DAY
<chr> <chr> <chr> <chr> <dttm> <dbl> <int> <int> <int>
1 oQYKPAsu9j8AAAF APPROVED "008" CREDIT 2022-03-15 15:16:26 2401 2022 3 15
2 hhoKPAs_fjUAAAF APPROVED "101" CREDIT 2022-03-15 15:15:23 959 2022 3 15
3 Ip8KPAsj__4AAAF DENIED "99" LIMIT 2022-03-15 15:14:06 0 2022 3 15
4 wa4KPAstYwIAAAF DENIED "99" LIMIT 2022-03-15 15:13:36 0 2022 3 15
5 GucKPAssdaUAAAF APPROVED "101" LIMIT 2022-03-15 15:13:21 1084 2022 3 15
6 a6AKPAtAsx4AAAF DENIED "101" CREDIT 2022-03-15 15:12:02 699 2022 3 15
7 a6AKPAtAsx4AAAF DENIED "101" CREDIT 2022-03-15 15:12:34 699 2022 3 15
8 F4kKPAss7OAAAAF APPROVED "101" CREDIT 2022-03-15 15:10:25 3167 2022 3 15
9 MK4KPAstiEYAAAF DENIED "99" LIMIT 2022-03-15 15:08:46 0 2022 3 15
10 .nUKPAs.crIAAAF APPROVED NA CREDIT 2022-03-15 15:08:35 58 2022 3 15
This shows some operations that different users have done on a website, each ID is one unique customer. I want to remove duplicate entries that have taken place within x minutes of each other. So clearly only row 6 or 7 in the data above should be kept (preferably the first). Is there a neat tidyverse/dplyr way of doing this?
My first thought was to ignore the LASTMODIFIED
column and use dg &>& filter(!duplicate())
but this will not do what I want.
Upvotes: 0
Views: 41
Reputation: 161155
Assuming the data is already ordered by LASTMODIFIED
(at least within each group), then
xseconds <- 600
dat %>%
group_by(across(-LASTMODIFIED)) %>%
filter(c(TRUE, as.numeric(diff(LASTMODIFIED), units="secs") > xseconds)) %>%
ungroup()
# # A tibble: 9 x 9
# ID STATUS NUMBER FUNCTION LASTMODIFIED AMOUNT YEAR MONTH DAY
# <chr> <chr> <int> <chr> <dttm> <int> <int> <int> <int>
# 1 oQYKPAsu9j8AAAF APPROVED 8 CREDIT 2022-03-15 15:16:26 2401 2022 3 15
# 2 hhoKPAs_fjUAAAF APPROVED 101 CREDIT 2022-03-15 15:15:23 959 2022 3 15
# 3 Ip8KPAsj__4AAAF DENIED 99 LIMIT 2022-03-15 15:14:06 0 2022 3 15
# 4 wa4KPAstYwIAAAF DENIED 99 LIMIT 2022-03-15 15:13:36 0 2022 3 15
# 5 GucKPAssdaUAAAF APPROVED 101 LIMIT 2022-03-15 15:13:21 1084 2022 3 15
# 6 a6AKPAtAsx4AAAF DENIED 101 CREDIT 2022-03-15 15:12:02 699 2022 3 15
# 7 F4kKPAss7OAAAAF APPROVED 101 CREDIT 2022-03-15 15:10:25 3167 2022 3 15
# 8 MK4KPAstiEYAAAF DENIED 99 LIMIT 2022-03-15 15:08:46 0 2022 3 15
# 9 .nUKPAs.crIAAAF APPROVED NA CREDIT 2022-03-15 15:08:35 58 2022 3 15
Data
dat <- structure(list(ID = c("oQYKPAsu9j8AAAF", "hhoKPAs_fjUAAAF", "Ip8KPAsj__4AAAF", "wa4KPAstYwIAAAF", "GucKPAssdaUAAAF", "a6AKPAtAsx4AAAF", "a6AKPAtAsx4AAAF", "F4kKPAss7OAAAAF", "MK4KPAstiEYAAAF", ".nUKPAs.crIAAAF"), STATUS = c("APPROVED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "APPROVED"), NUMBER = c(8L, 101L, 99L, 99L, 101L, 101L, 101L, 101L, 99L, NA), FUNCTION = c("CREDIT", "CREDIT", "LIMIT", "LIMIT", "LIMIT", "CREDIT", "CREDIT", "CREDIT", "LIMIT", "CREDIT" ), LASTMODIFIED = structure(c(1647371786, 1647371723, 1647371646, 1647371616, 1647371601, 1647371522, 1647371554, 1647371425, 1647371326, 1647371315), class = c("POSIXct", "POSIXt"), tzone = ""), AMOUNT = c(2401L, 959L, 0L, 0L, 1084L, 699L, 699L, 3167L, 0L, 58L), YEAR = c(2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L), MONTH = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), DAY = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
Upvotes: 2