Reputation: 87
I want to merge two data according to two conditions:
Also, I want to drop the rows if the ID in the combine data already has the result (either as healthy or sick). In the example below I want to drop the 3rd and 12th rows.
First data (dat1):
dat1 <- tibble(ID = c(paste0(rep("A"), 1:10), "A2", "A10"),
date_begin = seq(as.Date("2020/1/1"), by = "month", length.out = 12),
date_end = date_begin + 365)
dat1
# A tibble: 12 x 3
ID date_begin date_end
<chr> <date> <date>
1 A1 2020-01-01 2020-12-31
2 A2 2020-02-01 2021-01-31
3 A3 2020-03-01 2021-03-01
4 A4 2020-04-01 2021-04-01
5 A5 2020-05-01 2021-05-01
6 A6 2020-06-01 2021-06-01
7 A7 2020-07-01 2021-07-01
8 A8 2020-08-01 2021-08-01
9 A9 2020-09-01 2021-09-01
10 A10 2020-10-01 2021-10-01
11 A2 2020-11-01 2021-11-01
12 A10 2020-12-01 2021-12-01
Second data (dat2):
dat2 <- tibble(ID = c(paste0(rep("A"), 1:4), paste0(rep("A"), 9:15), "A2"),
date_mid = seq(as.Date("2020/1/1"), by = "month", length.out = 12) + 100,
result = rep(c("healthy", "sick"), length = 12))
dat2
# A tibble: 12 x 3
ID date_mid result
<chr> <date> <chr>
1 A1 2020-04-10 healthy
2 A2 2020-05-11 sick
3 A3 2020-06-09 healthy
4 A4 2020-07-10 sick
5 A9 2020-08-09 healthy
6 A10 2020-09-09 sick
7 A11 2020-10-09 healthy
8 A12 2020-11-09 sick
9 A13 2020-12-10 healthy
10 A14 2021-01-09 sick
11 A15 2021-02-09 healthy
12 A2 2021-03-11 sick
I have tried left_join
as below:
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA))
# A tibble: 14 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A2 2020-02-01 2021-01-31 2021-03-11 NA
4 A3 2020-03-01 2021-03-01 2020-06-09 healthy
5 A4 2020-04-01 2021-04-01 2020-07-10 sick
6 A5 2020-05-01 2021-05-01 NA NA
7 A6 2020-06-01 2021-06-01 NA NA
8 A7 2020-07-01 2021-07-01 NA NA
9 A8 2020-08-01 2021-08-01 NA NA
10 A9 2020-09-01 2021-09-01 2020-08-09 NA
11 A10 2020-10-01 2021-10-01 2020-09-09 NA
12 A2 2020-11-01 2021-11-01 2020-05-11 NA
13 A2 2020-11-01 2021-11-01 2021-03-11 sick
14 A10 2020-12-01 2021-12-01 2020-09-09 NA
As I mentioned, I want to drop the 3rd and 12th rows of ID A2, since A2 already have a result of either healthy or sick in 2nd and 13th rows.
The exact result that I want is something like this (only 2 rows of A2):
# A tibble: 12 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
4 A4 2020-04-01 2021-04-01 2020-07-10 sick
5 A5 2020-05-01 2021-05-01 NA NA
6 A6 2020-06-01 2021-06-01 NA NA
7 A7 2020-07-01 2021-07-01 NA NA
8 A8 2020-08-01 2021-08-01 NA NA
9 A9 2020-09-01 2021-09-01 2020-08-09 NA
10 A10 2020-10-01 2021-10-01 2020-09-09 NA
11 A2 2020-11-01 2021-11-01 2021-03-11 sick
12 A10 2020-12-01 2021-12-01 2020-09-09 NA
Any pointer is appreciated, thanks.
Upvotes: 0
Views: 56
Reputation: 389235
If there is more than one row for an ID
in the result after joining keep only the non-NA rows. This can be written in dplyr
as -
library(dplyr)
library(lubridate)
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA)) %>%
group_by(ID, date_begin, date_end) %>%
filter(if(n() > 1) !is.na(result) else TRUE) %>%
ungroup
# ID date_begin date_end date_mid result
# <chr> <date> <date> <date> <chr>
# 1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
# 2 A2 2020-02-01 2021-01-31 2020-05-11 sick
# 3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
# 4 A4 2020-04-01 2021-04-01 2020-07-10 sick
# 5 A5 2020-05-01 2021-05-01 NA NA
# 6 A6 2020-06-01 2021-06-01 NA NA
# 7 A7 2020-07-01 2021-07-01 NA NA
# 8 A8 2020-08-01 2021-08-01 NA NA
# 9 A9 2020-09-01 2021-09-01 2020-08-09 NA
#10 A10 2020-10-01 2021-10-01 2020-09-09 NA
#11 A2 2020-11-01 2021-11-01 2021-03-11 sick
#12 A10 2020-12-01 2021-12-01 2020-09-09 NA
Upvotes: 1