Reputation: 416
I would like to merge two dataframes based on mutliple columns. Here based on column B in df1 and all columns from X-Z in df2, but returning values from column X, into V1. Like a dictionary, if a from df1$B matches with a in df2$X, a is returned to df_merged$V1, but then if c from df1$B matches c with df2$Y, b is returned from df2$X, which is its synonym and etc. Only df2$X can be returned to df_merged$V1
df1
A B
1 a
2 c
3 f
and df2
X Y Z
a NA NA
b c NA
d e f
merged_df
A V1
1 a
2 b
3 d
Here is my try:
merge(df1, df2, by.x="B", by.y=c("X", "Y", "Z"), all.x=T)
Upvotes: 1
Views: 2249
Reputation: 18661
You can do this generically with tidyverse
, or you can actually use a hash/dictionary-like data structure. In R, there is no native hash table class, but you can take advantage of the hashmap
package, which uses Rcpp
internally to create hash-like objects:
library(tidyverse)
library(hashmap)
dict = df2 %>%
mutate_if(is.factor, as.character) %>%
mutate(Value = X) %>%
gather(Label, Key, -Value) %>%
na.omit() %>%
{hashmap(.$Key, .$Value)}
This gives you a hash table:
> dict
## (character) => (character)
## [e] => [d]
## [d] => [d]
## [f] => [d]
## [b] => [b]
## [a] => [a]
Now, to extract value using df1$B
as a key, simply do this:
dict[[df1$B]]
# [1] "a" NA "a" "d"
df1 %>%
mutate(Value = dict[[B]]) %>%
na.omit() %>%
select(-B)
Result:
A Value
1 1 a
3 3 a
4 4 d
Data:
df1 = read.table(text = "A B
1 a
2 c
3 a
4 e", header = TRUE, stringsAsFactors = TRUE)
df2 = read.table(text = "X Y Z
a NA NA
b NA NA
d e f", header = TRUE, stringsAsFactors = TRUE)
Upvotes: 2
Reputation: 18661
Here is the generic version for demonstrative purposes, but this method is less robust, less flexible, and less efficient than using hashmap
:
library(tidyverse)
df2 %>%
mutate(Value = X) %>%
gather(Label, Key, -Value) %>%
split(.$Label) %>%
map_dfr(~ cbind(A = na.omit(match(.$Key, df1$B)),
slice(., match(df1$B, .$Key)))) %>%
select(A, Value) %>%
arrange(A)
Result:
A Value
1 1 a
2 3 d
3 4 d
Data:
df1 = read.table(text = "A B
1 a
2 c
3 f
4 e", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "X Y Z
a NA NA
b NA NA
d e f", header = TRUE, stringsAsFactors = FALSE)
Note:
I first duplicated df2$X
because it is technically also a key. I then reshaped df2
to long form, and split
by Key
. For every Key, I sliced
the rows that match with df2$B
and rbind
ing the results together with map_dfr
. Finally, returning only the Value
column.
Notice that I used a different df1
in this example than I did in the hashmap
example because this method would not work if df1$B
has duplicates.
Upvotes: 1