Reputation: 3486
I have two df
d1 <- data_frame(name=c('Mike','Carl','Joe', 'Mike'),
id=c(1,2,NA,1),
id2=c('a',NA,'c','a'),
dateT=as.Date('2017-05-01','2017-05-01','2017-05-01','2016-05-01'))
# A tibble: 4 x 4
name id id2 dateT
<chr> <dbl> <chr> <date>
1 Mike 1 a 2017-05-01
2 Carl 2 <NA> 2017-05-01
3 Joe NA c 2017-05-01
4 Mike 1 a 2016-05-01
d2 <- data_frame(value=c(T, T, F, T,F),
id=c(1,2,3,1,4),
id2=c('a','b','c','a','d'),
dateT=as.Date('2017-05-01','2017-05-01','2017-05-01', '2016-05-01','2016-05-01'))
# A tibble: 5 x 4
value id id2 dateT
<lgl> <dbl> <chr> <date>
1 TRUE 1 a 2017-05-01
2 TRUE 2 b 2017-05-01
3 FALSE 3 c 2017-05-01
4 TRUE 1 a 2016-05-01
5 FALSE 4 d 2016-05-01
And I am trying to join d2
with d1
based the pairs dateT
and id
or date
and id2
depending on which one is available.
Eventually, I would like to have all the records from d2 (wether they matched or not), and attached all the columns from d1 which have a match on d2.
What I have so far is
d2 %>%
inner_join(d1, by=c('id'='id', 'dateT'='dateT'))
# A tibble: 5 x 6
value id id2.x dateT name id2.y
<lgl> <dbl> <chr> <date> <chr> <chr>
1 TRUE 1 a 2017-05-01 Mike a
2 TRUE 2 b 2017-05-01 Carl <NA>
3 FALSE 3 c 2017-05-01 <NA> <NA>
4 TRUE 1 a 2016-05-01 Mike a
5 FALSE 4 d 2016-05-01 <NA> <NA>
There are two issues:
I would rather attach only name
from d1
, since in my case id2.y
is always gonna be a subset of id2.x
(I want to keep only id2.x
because it is always gonna be most complete). I know I can drop the columns later on, but I was wondering if dplyr has any functionality that would keep only the columns from the first dataframe in presence of duplicated column names
I don't know how to conditionally look in id2
when is.na(id)==T
The expected output would look like:
# A tibble: 5 x 6
value id id2 dateT name
<lgl> <dbl> <chr> <date> <chr>
1 TRUE 1 a 2017-05-01 Mike
2 TRUE 2 b 2017-05-01 Carl
3 FALSE 3 c 2017-05-01 Joe
4 TRUE 1 a 2016-05-01 Mike
5 FALSE 4 d 2016-05-01 <NA>
Upvotes: 1
Views: 82
Reputation: 450
How about this?
# first join by id, dropping id2
join1 = select(d1, -id2) %>%
inner_join(d2, ., by=c("dateT", "id"))
# now take what couldn't be joined with id,
# drop id, join by id2, + left join to keep the remainder
join2 = d2 %>%
anti_join(d1, by=c("dateT", "id")) %>%
left_join(select(d1, -id), by=c("dateT", "id2"))
bind_rows(join1, join2)
# A tibble: 5 x 5
value id id2 dateT name
<lgl> <dbl> <chr> <date> <chr>
1 TRUE 1 a 2017-05-01 Mike
2 TRUE 2 b 2017-05-01 Carl
3 TRUE 1 a 2016-05-01 Mike
4 FALSE 4 d 2016-05-01 <NA>
5 FALSE 3 c 2017-05-01 Joe
Note that this code attempts to use id2
whenever id
join fails, not only when id
is NA
.
Frankly, my approach requires three joins, so on a huge dataset it could be faster to recode the id
variable - but that's difficult to do without knowing what it means in your data.
Upvotes: 1
Reputation: 5456
Try:
library(dplyr)
d1 <- data_frame(name=c('Mike','Carl','Joe', 'Mike'),
id=c(1,2,NA,1),
id2=c('a',NA,'c','a'),
dateT=as.Date(c('2017-05-01','2017-05-01','2017-05-01','2016-05-01')))
# add combine-"c" !!
d2 <- data_frame(value=c(T, T, F, T,F),
id=c(1,2,3,1,4),
id2=c('a','b','c','a','d'),
dateT=as.Date(c('2017-05-01','2017-05-01','2017-05-01', '2016-05-01','2016-05-01')))
# add combine-"c" !!
left_join(d2,
d1 %>%
select(name, id, dateT) %>%
rename(name1=name),
by=c("id", "dateT")) %>%
left_join(d1 %>%
select(name, id2, dateT) %>%
rename(name2=name),
by=c("id2", "dateT")) %>%
transmute(value, id, id2, dateT,
name=ifelse(is.na(name1),name2,name1))
Upvotes: 1