Reputation: 3
I was agonizing over how to phrase my question. I have a data frame of accounts and I want to create a new column that is a flag for whether there is another account that has a duplicate email within 30 days of that account.
I have a table like this.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(AccountNumbers,EmailAddress,Dates)
print(df)
AccountNumbers EmailAddress Dates
3748 [email protected] 2018-05-01
8894 [email protected] 2018-05-05
9923 [email protected] 2018-05-10
4502 [email protected] 2018-05-15
7283 [email protected] 2018-05-20
8012 [email protected] 2018-05-25
2938 [email protected] 2018-05-30
7485 [email protected] 2018-06-01
1010 [email protected] 2018-06-05
9877 [email protected] 2018-06-10
[email protected] appears three times, I want to flag the first two rows because they both appear within 30 days of each other, but I don't want to flag the third.
AccountNumbers EmailAddress Dates DuplicateEmailFlag
3748 [email protected] 2018-05-01 1
8894 [email protected] 2018-05-05 1
9923 [email protected] 2018-05-10 0
4502 [email protected] 2018-05-15 0
7283 [email protected] 2018-05-20 0
8012 [email protected] 2018-05-25 0
2938 [email protected] 2018-05-30 0
7485 [email protected] 2018-06-01 0
1010 [email protected] 2018-06-05 0
9877 [email protected] 2018-06-10 0
I've been trying to use an ifelse() inside of mutate, but I don't know if it's possible to tell dplyr to only consider rows that are within 30 days of the row being considered.
Edit: To clarify, I want to look at the 30 days around each account. So that if I had a scenario where the same email address was being added exactly every 30 days, all of the occurrences of that email should be flagged.
Upvotes: 0
Views: 151
Reputation: 5405
I think this gets at what you want:
df %>%
group_by(EmailAddress) %>%
mutate(helper = cumsum(coalesce(if_else(difftime(Dates, lag(Dates), 'days') <= 30, 0, 1), 0))) %>%
group_by(EmailAddress, helper) %>%
mutate(DuplicateEmailFlag = (n() >= 2)*1) %>%
ungroup() %>%
select(-helper)
# A tibble: 10 x 4
AccountNumbers EmailAddress Dates DuplicateEmailFlag
<dbl> <chr> <date> <dbl>
1 3748 [email protected] 2018-05-01 1
2 8894 [email protected] 2018-05-05 1
3 9923 [email protected] 2018-05-10 0
4 4502 [email protected] 2018-05-15 0
5 7283 [email protected] 2018-05-20 0
6 8012 [email protected] 2018-05-25 0
7 2938 [email protected] 2018-05-30 0
8 7485 [email protected] 2018-06-01 0
9 1010 [email protected] 2018-06-05 0
10 9877 [email protected] 2018-06-10 0
I think @Lyngbakr's solution is better for the circumstances in your question. Mine would be more appropriate if the size of the duplicate group might change (e.g., you want to check for 3 or 4 entries within 30 days of each other, rather than 2).
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]")
Dates <- as.Date(c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10"))
df <- data.frame(AccountNumbers,EmailAddress,Dates, stringsAsFactors = FALSE)
Upvotes: 0
Reputation: 12074
This seems to work. First, I define the data frame.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(number = AccountNumbers, email = EmailAddress, date = as.Date(Dates))
Next, I group by email and check if there's an entry in the preceding or following 30 days. I also replace NA
s (corresponding to cases with only one entry) with 0. Finally, I ungroup.
df %>%
group_by(email) %>%
mutate(dupe = coalesce(date - lag(date) < 30, (date - lead(date) < 30))) %>%
mutate(dupe = replace_na(dupe, 0)) %>%
ungroup
This gives,
# # A tibble: 10 x 4
# number email date dupe
# <dbl> <fct> <date> <dbl>
# 1 3748 [email protected] 2018-05-01 1
# 2 8894 [email protected] 2018-05-05 1
# 3 9923 [email protected] 2018-05-10 0
# 4 4502 [email protected] 2018-05-15 0
# 5 7283 [email protected] 2018-05-20 0
# 6 8012 [email protected] 2018-05-25 0
# 7 2938 [email protected] 2018-05-30 0
# 8 7485 [email protected] 2018-06-01 0
# 9 1010 [email protected] 2018-06-05 0
# 10 9877 [email protected] 2018-06-10 0
as required.
Edit: This makes the implicit assumption that your data are sorted by date. If not, you'd need to add an extra step to do so.
Upvotes: 1