Reputation: 1710
I have two dataframes that I am trying to merge :
set.seed(123)
df1 <- data.frame(ID=sample(letters[1:6],10,replace=TRUE))
df2 <- data.frame(
ID1 = letters[1:2],
ID2 = letters[3:4],
ID3 = letters[5:6],
V1 = c(23.32,21.24),
V2 = c(45.32,47.21)
)
Post merging, I want my df1
to contain the columns V1
and V2
along with ID
. I have tried using merge
, left_join
and inner_join
(from dplyr
) but can't figure out how to use the by
argument. The ID
column from df1
could exist in any of the three columns (ID1
, ID2
and ID3
) of df2
. How can I achieve this?
Upvotes: 0
Views: 419
Reputation: 834
The by
argument is used to specify the ID columns you want to join on assuming they are named differently for the left & right tables (if its the same name then it will automatically choose).
However, I have a way to simplify what you want to do. First, why not reshape df2
to only have 1 single ID column (assumption that the ID is unique within all 3 columns). You can do this by creating 3 separate dfs and then union together using bind_rows
.
Now that it is reshaped, you can do a right join. df1
is on the right side of the join and all the records in df1
will remain whether or not there is a match with df2
(otherwise V1
and V2
will be NULL i.e. NA
).
With the sample data provided from df1
, the results would be unexpected because each ID is repeated and not unique (so I have redefined df1
to have unique IDs only). If the IDs are not unique, you can group the results by ID and do an aggregation prior to doing the join.
set.seed(123)
#df1 <- data.frame(ID=sample(letters[1:6],10,replace=TRUE)) #This one has repeated IDs
df1 <- data.frame(ID=letters[1:6])
df2 <- data.frame(
ID1 = letters[1:2],
ID2 = letters[3:4],
ID3 = letters[5:6],
V1 = c(23.32,21.24),
V2 = c(45.32,47.21)
)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.4.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- bind_rows(df2 %>% select(ID=ID1, V1, V2),
df2 %>% select(ID=ID2, V1, V2),
df2 %>% select(ID=ID3, V1, V2)) %>%
right_join(df1)
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Joining, by = "ID"
#> Warning: Column `ID` joining character vector and factor, coercing into
#> character vector
df
#> ID V1 V2
#> 1 a 23.32 45.32
#> 2 b 21.24 47.21
#> 3 c 23.32 45.32
#> 4 d 21.24 47.21
#> 5 e 23.32 45.32
#> 6 f 21.24 47.21
Upvotes: 0
Reputation: 4187
You have to reshape in long format first, then join:
library(dplyr)
library(tidyr)
df2 %>%
gather(IDnr, ID, 1:3) %>%
left_join(df1, ., by = 'ID')
# alternative:
df1 %>%
left_join(., df2 %>% gather(IDnr, ID, 1:3), by = 'ID')
The result:
ID V1 V2 IDnr
1 d 21.24 47.21 ID2
2 e 23.32 45.32 ID3
3 f 21.24 47.21 ID3
4 d 21.24 47.21 ID2
5 f 21.24 47.21 ID3
6 c 23.32 45.32 ID2
7 a 23.32 45.32 ID1
8 e 23.32 45.32 ID3
9 a 23.32 45.32 ID1
10 d 21.24 47.21 ID2
Upvotes: 1