user2904120
user2904120

Reputation: 416

Replace key column with values using a hash table/dictionary in R

I would like to merge two dataframes based on mutliple columns. Here based on column B in df1 and all columns from X-Z in df2, but returning values from column X, into V1. Like a dictionary, if a from df1$B matches with a in df2$X, a is returned to df_merged$V1, but then if c from df1$B matches c with df2$Y, b is returned from df2$X, which is its synonym and etc. Only df2$X can be returned to df_merged$V1

df1

A   B
1   a
2   c
3   f

and df2

X   Y   Z
a   NA  NA
b   c   NA
d   e   f

merged_df

A   V1
1   a
2   b
3   d

Here is my try:

merge(df1, df2, by.x="B", by.y=c("X", "Y", "Z"), all.x=T)

Upvotes: 1

Views: 2249

Answers (2)

acylam
acylam

Reputation: 18661

You can do this generically with tidyverse, or you can actually use a hash/dictionary-like data structure. In R, there is no native hash table class, but you can take advantage of the hashmap package, which uses Rcpp internally to create hash-like objects:

library(tidyverse)
library(hashmap)

dict = df2 %>%
  mutate_if(is.factor, as.character) %>%
  mutate(Value = X) %>%
  gather(Label, Key, -Value) %>%
  na.omit() %>%
  {hashmap(.$Key, .$Value)}

This gives you a hash table:

> dict
## (character) => (character)
##         [e] => [d]        
##         [d] => [d]        
##         [f] => [d]        
##         [b] => [b]        
##         [a] => [a] 

Now, to extract value using df1$B as a key, simply do this:

dict[[df1$B]]
# [1] "a" NA  "a" "d"

df1 %>%
  mutate(Value = dict[[B]]) %>%
  na.omit() %>%
  select(-B)

Result:

  A Value
1 1     a
3 3     a
4 4     d

Data:

df1 = read.table(text = "A   B
                 1   a
                 2   c
                 3   a
                 4   e", header = TRUE, stringsAsFactors = TRUE)

df2 = read.table(text = "X   Y   Z
                 a   NA  NA
                 b   NA  NA
                 d   e   f", header = TRUE, stringsAsFactors = TRUE)

Upvotes: 2

acylam
acylam

Reputation: 18661

Here is the generic version for demonstrative purposes, but this method is less robust, less flexible, and less efficient than using hashmap:

library(tidyverse)

df2 %>%
  mutate(Value = X) %>%
  gather(Label, Key, -Value) %>%
  split(.$Label) %>%
  map_dfr(~ cbind(A = na.omit(match(.$Key, df1$B)), 
              slice(., match(df1$B, .$Key)))) %>%
  select(A, Value) %>%
  arrange(A)

Result:

  A Value
1 1     a
2 3     d
3 4     d

Data:

df1 = read.table(text = "A   B
                 1   a
                 2   c
                 3   f
                 4   e", header = TRUE, stringsAsFactors = FALSE)

df2 = read.table(text = "X   Y   Z
                 a   NA  NA
                 b   NA  NA
                 d   e   f", header = TRUE, stringsAsFactors = FALSE)

Note:

I first duplicated df2$X because it is technically also a key. I then reshaped df2 to long form, and split by Key. For every Key, I sliced the rows that match with df2$B and rbinding the results together with map_dfr. Finally, returning only the Value column.

Notice that I used a different df1 in this example than I did in the hashmap example because this method would not work if df1$B has duplicates.

Upvotes: 1

Related Questions