Tengku Hanis
Tengku Hanis

Reputation: 87

Combine data by several condition in R

I want to merge two data according to two conditions:

  1. by same ID (only ID in the first data is retained)
  2. if date_mid (from dat2) is in between date_begin and date_end (both from dat1), paste the result (from dat2), if not, noted as "NA"

Also, I want to drop the rows if the ID in the combine data already has the result (either as healthy or sick). In the example below I want to drop the 3rd and 12th rows.

First data (dat1):

dat1 <- tibble(ID = c(paste0(rep("A"), 1:10), "A2", "A10"), 
           date_begin = seq(as.Date("2020/1/1"), by = "month", length.out = 12), 
           date_end = date_begin + 365)
dat1

# A tibble: 12 x 3
   ID    date_begin date_end  
   <chr> <date>     <date>    
 1 A1    2020-01-01 2020-12-31
 2 A2    2020-02-01 2021-01-31
 3 A3    2020-03-01 2021-03-01
 4 A4    2020-04-01 2021-04-01
 5 A5    2020-05-01 2021-05-01
 6 A6    2020-06-01 2021-06-01
 7 A7    2020-07-01 2021-07-01
 8 A8    2020-08-01 2021-08-01
 9 A9    2020-09-01 2021-09-01
10 A10   2020-10-01 2021-10-01
11 A2    2020-11-01 2021-11-01
12 A10   2020-12-01 2021-12-01

Second data (dat2):

dat2 <- tibble(ID = c(paste0(rep("A"), 1:4), paste0(rep("A"), 9:15), "A2"), 
               date_mid = seq(as.Date("2020/1/1"), by = "month", length.out = 12) + 100, 
               result = rep(c("healthy", "sick"), length = 12))
dat2

# A tibble: 12 x 3
   ID    date_mid   result 
   <chr> <date>     <chr>  
 1 A1    2020-04-10 healthy
 2 A2    2020-05-11 sick   
 3 A3    2020-06-09 healthy
 4 A4    2020-07-10 sick   
 5 A9    2020-08-09 healthy
 6 A10   2020-09-09 sick   
 7 A11   2020-10-09 healthy
 8 A12   2020-11-09 sick   
 9 A13   2020-12-10 healthy
10 A14   2021-01-09 sick   
11 A15   2021-02-09 healthy
12 A2    2021-03-11 sick 

I have tried left_join as below:

left_join(dat1, dat2, by = "ID") %>% 
  mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA))

# A tibble: 14 x 5
   ID    date_begin date_end   date_mid   result 
   <chr> <date>     <date>     <date>     <chr>  
 1 A1    2020-01-01 2020-12-31 2020-04-10 healthy
 2 A2    2020-02-01 2021-01-31 2020-05-11 sick   
 3 A2    2020-02-01 2021-01-31 2021-03-11 NA     
 4 A3    2020-03-01 2021-03-01 2020-06-09 healthy
 5 A4    2020-04-01 2021-04-01 2020-07-10 sick   
 6 A5    2020-05-01 2021-05-01 NA         NA     
 7 A6    2020-06-01 2021-06-01 NA         NA     
 8 A7    2020-07-01 2021-07-01 NA         NA     
 9 A8    2020-08-01 2021-08-01 NA         NA     
10 A9    2020-09-01 2021-09-01 2020-08-09 NA     
11 A10   2020-10-01 2021-10-01 2020-09-09 NA     
12 A2    2020-11-01 2021-11-01 2020-05-11 NA     
13 A2    2020-11-01 2021-11-01 2021-03-11 sick   
14 A10   2020-12-01 2021-12-01 2020-09-09 NA  

As I mentioned, I want to drop the 3rd and 12th rows of ID A2, since A2 already have a result of either healthy or sick in 2nd and 13th rows.

The exact result that I want is something like this (only 2 rows of A2):

# A tibble: 12 x 5
   ID    date_begin date_end   date_mid   result 
   <chr> <date>     <date>     <date>     <chr>  
 1 A1    2020-01-01 2020-12-31 2020-04-10 healthy
 2 A2    2020-02-01 2021-01-31 2020-05-11 sick   
 3 A3    2020-03-01 2021-03-01 2020-06-09 healthy
 4 A4    2020-04-01 2021-04-01 2020-07-10 sick   
 5 A5    2020-05-01 2021-05-01 NA         NA     
 6 A6    2020-06-01 2021-06-01 NA         NA     
 7 A7    2020-07-01 2021-07-01 NA         NA     
 8 A8    2020-08-01 2021-08-01 NA         NA     
 9 A9    2020-09-01 2021-09-01 2020-08-09 NA     
10 A10   2020-10-01 2021-10-01 2020-09-09 NA     
11 A2    2020-11-01 2021-11-01 2021-03-11 sick   
12 A10   2020-12-01 2021-12-01 2020-09-09 NA  

Any pointer is appreciated, thanks.

Upvotes: 0

Views: 56

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389235

If there is more than one row for an ID in the result after joining keep only the non-NA rows. This can be written in dplyr as -

library(dplyr)
library(lubridate)

left_join(dat1, dat2, by = "ID") %>% 
  mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA)) %>%
  group_by(ID, date_begin, date_end)  %>%
  filter(if(n() > 1) !is.na(result) else TRUE) %>%
  ungroup


#   ID    date_begin date_end   date_mid   result 
#   <chr> <date>     <date>     <date>     <chr>  
# 1 A1    2020-01-01 2020-12-31 2020-04-10 healthy
# 2 A2    2020-02-01 2021-01-31 2020-05-11 sick   
# 3 A3    2020-03-01 2021-03-01 2020-06-09 healthy
# 4 A4    2020-04-01 2021-04-01 2020-07-10 sick   
# 5 A5    2020-05-01 2021-05-01 NA         NA     
# 6 A6    2020-06-01 2021-06-01 NA         NA     
# 7 A7    2020-07-01 2021-07-01 NA         NA     
# 8 A8    2020-08-01 2021-08-01 NA         NA     
# 9 A9    2020-09-01 2021-09-01 2020-08-09 NA     
#10 A10   2020-10-01 2021-10-01 2020-09-09 NA     
#11 A2    2020-11-01 2021-11-01 2021-03-11 sick   
#12 A10   2020-12-01 2021-12-01 2020-09-09 NA     

Upvotes: 1

Related Questions