How can I use mutate to create a new column based only on a subset of other rows of a data frame?

Question

I was agonizing over how to phrase my question. I have a data frame of accounts and I want to create a new column that is a flag for whether there is another account that has a duplicate email within 30 days of that account.

I have a table like this.

AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John@gmail.com","John@gmail.com","Alex@outlook.com","Alan@yahoo.com","Stan@aol.com","Mary@outlook.com","Adam@outlook.com","Tom@aol.com","Jane@yahoo.com","John@gmail.com")

Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
       "2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")

df <- data.frame(AccountNumbers,EmailAddress,Dates)

print(df)

AccountNumbers     EmailAddress      Dates
3748           John@gmail.com    2018-05-01
8894           John@gmail.com    2018-05-05
9923           Alex@outlook.com  2018-05-10
4502           Alan@yahoo.com    2018-05-15
7283           Stan@aol.com      2018-05-20
8012           Mary@outlook.com  2018-05-25
2938           Adam@outlook.com  2018-05-30
7485           Tom@aol.com       2018-06-01
1010           Jane@yahoo.com    2018-06-05
9877           John@gmail.com    2018-06-10

John@gmail.com appears three times, I want to flag the first two rows because they both appear within 30 days of each other, but I don't want to flag the third.

AccountNumbers     EmailAddress      Dates        DuplicateEmailFlag
3748           John@gmail.com    2018-05-01                  1
8894           John@gmail.com    2018-05-05                  1
9923           Alex@outlook.com  2018-05-10                  0
4502           Alan@yahoo.com    2018-05-15                  0
7283           Stan@aol.com      2018-05-20                  0
8012           Mary@outlook.com  2018-05-25                  0
2938           Adam@outlook.com  2018-05-30                  0
7485           Tom@aol.com       2018-06-01                  0
1010           Jane@yahoo.com    2018-06-05                  0
9877           John@gmail.com    2018-06-10                  0

I've been trying to use an ifelse() inside of mutate, but I don't know if it's possible to tell dplyr to only consider rows that are within 30 days of the row being considered.

Edit: To clarify, I want to look at the 30 days around each account. So that if I had a scenario where the same email address was being added exactly every 30 days, all of the occurrences of that email should be flagged.

Dan · Accepted Answer

This seems to work. First, I define the data frame.

AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John@gmail.com","John@gmail.com","Alex@outlook.com","Alan@yahoo.com","Stan@aol.com","Mary@outlook.com","Adam@outlook.com","Tom@aol.com","Jane@yahoo.com","John@gmail.com")

Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
           "2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")

df <- data.frame(number = AccountNumbers, email = EmailAddress, date = as.Date(Dates))

Next, I group by email and check if there's an entry in the preceding or following 30 days. I also replace NAs (corresponding to cases with only one entry) with 0. Finally, I ungroup.

df %>% 
  group_by(email) %>% 
  mutate(dupe = coalesce(date - lag(date) < 30, (date - lead(date) < 30))) %>% 
  mutate(dupe = replace_na(dupe, 0)) %>% 
  ungroup

This gives,

# # A tibble: 10 x 4
#    number email            date        dupe
#                       
#  1   3748 John@gmail.com   2018-05-01     1
#  2   8894 John@gmail.com   2018-05-05     1
#  3   9923 Alex@outlook.com 2018-05-10     0
#  4   4502 Alan@yahoo.com   2018-05-15     0
#  5   7283 Stan@aol.com     2018-05-20     0
#  6   8012 Mary@outlook.com 2018-05-25     0
#  7   2938 Adam@outlook.com 2018-05-30     0
#  8   7485 Tom@aol.com      2018-06-01     0
#  9   1010 Jane@yahoo.com   2018-06-05     0
# 10   9877 John@gmail.com   2018-06-10     0

as required.

Edit: This makes the implicit assumption that your data are sorted by date. If not, you'd need to add an extra step to do so.

How can I use mutate to create a new column based only on a subset of other rows of a data frame?

Answers (2)

Note:

slightly modified data

Related Questions