How to join dataframes conditionally on column pairs?

Question

I have two df

d1 <- data_frame(name=c('Mike','Carl','Joe', 'Mike'), 
           id=c(1,2,NA,1), 
           id2=c('a',NA,'c','a'),
           dateT=as.Date('2017-05-01','2017-05-01','2017-05-01','2016-05-01'))


   # A tibble: 4 x 4
   name    id   id2      dateT
         
1  Mike     1     a 2017-05-01
2  Carl     2   2017-05-01
3   Joe    NA     c 2017-05-01
4  Mike     1     a 2016-05-01

d2 <- data_frame(value=c(T, T, F, T,F), 
                 id=c(1,2,3,1,4), 
                 id2=c('a','b','c','a','d'),
                 dateT=as.Date('2017-05-01','2017-05-01','2017-05-01', '2016-05-01','2016-05-01'))
# A tibble: 5 x 4
  value    id   id2      dateT
         
1  TRUE     1     a 2017-05-01
2  TRUE     2     b 2017-05-01
3 FALSE     3     c 2017-05-01
4  TRUE     1     a 2016-05-01
5 FALSE     4     d 2016-05-01

And I am trying to join d2 with d1 based the pairs dateT and id or date and id2 depending on which one is available. Eventually, I would like to have all the records from d2 (wether they matched or not), and attached all the columns from d1 which have a match on d2.

What I have so far is

d2 %>% 
  inner_join(d1, by=c('id'='id', 'dateT'='dateT'))



 # A tibble: 5 x 6
  value    id id2.x      dateT  name id2.y
           
1  TRUE     1     a 2017-05-01  Mike     a
2  TRUE     2     b 2017-05-01  Carl  
3 FALSE     3     c 2017-05-01    
4  TRUE     1     a 2016-05-01  Mike     a
5 FALSE     4     d 2016-05-01

There are two issues:

I would rather attach only name from d1, since in my case id2.y is always gonna be a subset of id2.x (I want to keep only id2.x because it is always gonna be most complete). I know I can drop the columns later on, but I was wondering if dplyr has any functionality that would keep only the columns from the first dataframe in presence of duplicated column names
I don't know how to conditionally look in id2 when is.na(id)==T

The expected output would look like:

# A tibble: 5 x 6
  value    id   id2      dateT  name 
           
1  TRUE     1     a 2017-05-01  Mike 
2  TRUE     2     b 2017-05-01  Carl 
3 FALSE     3     c 2017-05-01   Joe 
4  TRUE     1     a 2016-05-01  Mike 
5 FALSE     4     d 2016-05-01

juod · Accepted Answer

How about this?

# first join by id, dropping id2
join1 = select(d1, -id2) %>%
    inner_join(d2, ., by=c("dateT", "id"))
# now take what couldn't be joined with id,
# drop id, join by id2, + left join to keep the remainder
join2 = d2 %>%
    anti_join(d1, by=c("dateT", "id")) %>%
    left_join(select(d1, -id), by=c("dateT", "id2"))

bind_rows(join1, join2)

# A tibble: 5 x 5
  value    id   id2      dateT  name
          
1  TRUE     1     a 2017-05-01  Mike
2  TRUE     2     b 2017-05-01  Carl
3  TRUE     1     a 2016-05-01  Mike
4 FALSE     4     d 2016-05-01  
5 FALSE     3     c 2017-05-01   Joe

Note that this code attempts to use id2 whenever id join fails, not only when id is NA.
Frankly, my approach requires three joins, so on a huge dataset it could be faster to recode the id variable - but that's difficult to do without knowing what it means in your data.

How to join dataframes conditionally on column pairs?

Answers (2)

Related Questions