Dambo
Dambo

Reputation: 3486

How to join dataframes conditionally on column pairs?

I have two df

d1 <- data_frame(name=c('Mike','Carl','Joe', 'Mike'), 
           id=c(1,2,NA,1), 
           id2=c('a',NA,'c','a'),
           dateT=as.Date('2017-05-01','2017-05-01','2017-05-01','2016-05-01'))


   # A tibble: 4 x 4
   name    id   id2      dateT
  <chr> <dbl> <chr>     <date>
1  Mike     1     a 2017-05-01
2  Carl     2  <NA> 2017-05-01
3   Joe    NA     c 2017-05-01
4  Mike     1     a 2016-05-01

d2 <- data_frame(value=c(T, T, F, T,F), 
                 id=c(1,2,3,1,4), 
                 id2=c('a','b','c','a','d'),
                 dateT=as.Date('2017-05-01','2017-05-01','2017-05-01', '2016-05-01','2016-05-01'))
# A tibble: 5 x 4
  value    id   id2      dateT
  <lgl> <dbl> <chr>     <date>
1  TRUE     1     a 2017-05-01
2  TRUE     2     b 2017-05-01
3 FALSE     3     c 2017-05-01
4  TRUE     1     a 2016-05-01
5 FALSE     4     d 2016-05-01

And I am trying to join d2 with d1 based the pairs dateT and id or date and id2 depending on which one is available. Eventually, I would like to have all the records from d2 (wether they matched or not), and attached all the columns from d1 which have a match on d2.

What I have so far is

d2 %>% 
  inner_join(d1, by=c('id'='id', 'dateT'='dateT'))



 # A tibble: 5 x 6
  value    id id2.x      dateT  name id2.y
  <lgl> <dbl> <chr>     <date> <chr> <chr>
1  TRUE     1     a 2017-05-01  Mike     a
2  TRUE     2     b 2017-05-01  Carl  <NA>
3 FALSE     3     c 2017-05-01  <NA>  <NA>
4  TRUE     1     a 2016-05-01  Mike     a
5 FALSE     4     d 2016-05-01  <NA>  <NA>

There are two issues:

  1. I would rather attach only name from d1, since in my case id2.y is always gonna be a subset of id2.x (I want to keep only id2.x because it is always gonna be most complete). I know I can drop the columns later on, but I was wondering if dplyr has any functionality that would keep only the columns from the first dataframe in presence of duplicated column names

  2. I don't know how to conditionally look in id2 when is.na(id)==T

The expected output would look like:

# A tibble: 5 x 6
  value    id   id2      dateT  name 
  <lgl> <dbl> <chr>     <date> <chr> 
1  TRUE     1     a 2017-05-01  Mike 
2  TRUE     2     b 2017-05-01  Carl 
3 FALSE     3     c 2017-05-01   Joe 
4  TRUE     1     a 2016-05-01  Mike 
5 FALSE     4     d 2016-05-01  <NA> 

Upvotes: 1

Views: 82

Answers (2)

juod
juod

Reputation: 450

How about this?

# first join by id, dropping id2
join1 = select(d1, -id2) %>%
    inner_join(d2, ., by=c("dateT", "id"))
# now take what couldn't be joined with id,
# drop id, join by id2, + left join to keep the remainder
join2 = d2 %>%
    anti_join(d1, by=c("dateT", "id")) %>%
    left_join(select(d1, -id), by=c("dateT", "id2"))

bind_rows(join1, join2)

# A tibble: 5 x 5
  value    id   id2      dateT  name
  <lgl> <dbl> <chr>     <date> <chr>
1  TRUE     1     a 2017-05-01  Mike
2  TRUE     2     b 2017-05-01  Carl
3  TRUE     1     a 2016-05-01  Mike
4 FALSE     4     d 2016-05-01  <NA>
5 FALSE     3     c 2017-05-01   Joe

Note that this code attempts to use id2 whenever id join fails, not only when id is NA.
Frankly, my approach requires three joins, so on a huge dataset it could be faster to recode the id variable - but that's difficult to do without knowing what it means in your data.

Upvotes: 1

r.user.05apr
r.user.05apr

Reputation: 5456

Try:

library(dplyr)

d1 <- data_frame(name=c('Mike','Carl','Joe', 'Mike'), 
                 id=c(1,2,NA,1), 
                 id2=c('a',NA,'c','a'),
                 dateT=as.Date(c('2017-05-01','2017-05-01','2017-05-01','2016-05-01')))
# add combine-"c" !!

d2 <- data_frame(value=c(T, T, F, T,F), 
                 id=c(1,2,3,1,4), 
                 id2=c('a','b','c','a','d'),
                 dateT=as.Date(c('2017-05-01','2017-05-01','2017-05-01', '2016-05-01','2016-05-01')))
# add combine-"c" !!

left_join(d2,
          d1 %>%
            select(name, id, dateT) %>%
            rename(name1=name),
          by=c("id", "dateT")) %>%
  left_join(d1 %>%
              select(name, id2, dateT) %>%
              rename(name2=name),
            by=c("id2", "dateT")) %>%
  transmute(value, id, id2, dateT, 
            name=ifelse(is.na(name1),name2,name1))

Upvotes: 1

Related Questions