Dhiraj
Dhiraj

Reputation: 1710

Merging two dataframes on multiple columns

I have two dataframes that I am trying to merge :

set.seed(123)
df1 <- data.frame(ID=sample(letters[1:6],10,replace=TRUE))
df2 <- data.frame(
  ID1 = letters[1:2],
  ID2 = letters[3:4],
  ID3 = letters[5:6],
  V1 = c(23.32,21.24),
  V2 = c(45.32,47.21)
)

Post merging, I want my df1 to contain the columns V1 and V2 along with ID. I have tried using merge, left_join and inner_join (from dplyr) but can't figure out how to use the by argument. The ID column from df1 could exist in any of the three columns (ID1, ID2 and ID3) of df2. How can I achieve this?

Upvotes: 0

Views: 419

Answers (2)

Jas
Jas

Reputation: 834

The by argument is used to specify the ID columns you want to join on assuming they are named differently for the left & right tables (if its the same name then it will automatically choose).

However, I have a way to simplify what you want to do. First, why not reshape df2 to only have 1 single ID column (assumption that the ID is unique within all 3 columns). You can do this by creating 3 separate dfs and then union together using bind_rows.

Now that it is reshaped, you can do a right join. df1 is on the right side of the join and all the records in df1 will remain whether or not there is a match with df2 (otherwise V1 and V2 will be NULL i.e. NA).

With the sample data provided from df1, the results would be unexpected because each ID is repeated and not unique (so I have redefined df1 to have unique IDs only). If the IDs are not unique, you can group the results by ID and do an aggregation prior to doing the join.

set.seed(123)
#df1 <- data.frame(ID=sample(letters[1:6],10,replace=TRUE)) #This one has repeated IDs
df1 <- data.frame(ID=letters[1:6])
df2 <- data.frame(
  ID1 = letters[1:2],
  ID2 = letters[3:4],
  ID3 = letters[5:6],
  V1 = c(23.32,21.24),
  V2 = c(45.32,47.21)
)

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.4.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- bind_rows(df2 %>% select(ID=ID1, V1, V2),
                df2 %>% select(ID=ID2, V1, V2),
                df2 %>% select(ID=ID3, V1, V2)) %>% 
      right_join(df1)
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector

#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector

#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Joining, by = "ID"
#> Warning: Column `ID` joining character vector and factor, coercing into
#> character vector
df
#>   ID    V1    V2
#> 1  a 23.32 45.32
#> 2  b 21.24 47.21
#> 3  c 23.32 45.32
#> 4  d 21.24 47.21
#> 5  e 23.32 45.32
#> 6  f 21.24 47.21

Upvotes: 0

h3rm4n
h3rm4n

Reputation: 4187

You have to reshape in long format first, then join:

library(dplyr)
library(tidyr)

df2 %>% 
  gather(IDnr, ID, 1:3) %>% 
  left_join(df1, ., by = 'ID')

# alternative:
df1 %>% 
  left_join(., df2 %>% gather(IDnr, ID, 1:3), by = 'ID')

The result:

   ID    V1    V2 IDnr
1   d 21.24 47.21  ID2
2   e 23.32 45.32  ID3
3   f 21.24 47.21  ID3
4   d 21.24 47.21  ID2
5   f 21.24 47.21  ID3
6   c 23.32 45.32  ID2
7   a 23.32 45.32  ID1
8   e 23.32 45.32  ID3
9   a 23.32 45.32  ID1
10  d 21.24 47.21  ID2

Upvotes: 1

Related Questions