datamole
datamole

Reputation: 155

Translate a vector of values using a key value mapping in R (equivalent to a HashMap)

I need to translate the values in a vector according to a mapping of key value pairs:

vector <- c("dog","ant","eagle","ant","eagle","parrot") 

  "dog"  "ant"  "eagle"  "ant"  "eagle"  "parrot"


mapping <- data.frame(key=c("dog","cat","elephant","ant","parrot","eagle"),value=c("mammal","mammal","mammal","insect","bird","bird"))

  key      value
  dog      mammal
  cat      mammal
  elephant mammal
  ant      insect
  parrot   bird
  eagle    bird

The desired output would be like this:

output <- ("mammal", "insect", "bird", "insect", "bird", "bird") 

In the real dataset I have to translate ~10000 input vectors of an average length of ~15 and the mapping data-frame is in the range of a million keys with about 100000 unique classes on the side of the values.

The problem itself appears rather basic to me, but the bottleneck is runtime. In other programming languages you would probably use a HashMap for the mapping and then loop through the vector. Any solution in R I could come up with so far is orders of magnitude slower than a simple HashMap-based one in Java or Python (see comments below).

Is there a more efficient data structure to store the mapping than a data frame?

What would be the most runtime-efficient solution to this problem in R?

Upvotes: 7

Views: 2791

Answers (4)

D-prat
D-prat

Reputation: 11

You could use a named vector to store the key (vector names) value (vector values) pairs and then grab the desired values by the desired names.

animals = c('dog', 'ant', 'eagle', 'ant', 'eagle', 'parrot')
key_value = c('dog' = 'mammal',
                 'cat' = 'mammal',
                 'elephant' = 'mammal',
                 'ant' = 'insect',
                 'parrot' = 'bird',
                 'eagle' = 'bird')
key_value[animals]

The result is a named vector, with the original keys as the names:

     dog      ant    eagle      ant    eagle   parrot 
"mammal" "insect"   "bird" "insect"   "bird"   "bird" 

The previous poster gave this solution, which was simplified into the code above.

animals = c('dog', 'ant', 'eagle', 'ant', 'eagle', 'parrot')

key_value = list('dog' = 'mammal',
                 'cat' = 'mammal',
                 'elephant' = 'mammal',
                 'ant' = 'insect',
                 'parrot' = 'bird',
                 'eagle' = 'bird')

unlist(lapply(animals, FUN = function(x){key_value[[x]]}))

> unlist(lapply(animals, FUN = function(x){key_value[[x]]}))
[1] "mammal" "insect" "bird"   "insect" "bird"   "bird"  

Upvotes: 1

acylam
acylam

Reputation: 18691

There is a package called hashmap which is perfect for this:

library(hashmap)

hash_lookup = hashmap(mapping$key, mapping$value)

output = hash_lookup[[vector]]

Result:

> hash_lookup
## (character) => (character)
##       [cat] => [mammal]   
##  [elephant] => [mammal]   
##       [ant] => [insect]   
##       [dog] => [mammal]   
##     [eagle] => [bird]     
##    [parrot] => [bird]     

> output
[1] "mammal" "insect" "bird"   "insect" "bird"   "bird"

Data:

vector <- c("dog","ant","eagle","ant","eagle","parrot")

mapping <- data.frame(key=c("dog","cat","elephant","ant","parrot","eagle"),
                      value=c("mammal","mammal","mammal","insect","bird","bird"),
                      stringsAsFactors = FALSE)

Note:

Have to test this on a bigger dataset, but this method should be very fast since it is implemented with Rcpp internally.

Upvotes: 5

kiran k madan
kiran k madan

Reputation: 1

One option would be to factor the vector and change the levels.

mapping = data.table(mapping)

setkey(mapping, key)

vector = factor(vector)

levels(vector) = mapping[levels(vector),value]

Upvotes: 0

CJB
CJB

Reputation: 1819

What about in a list? Start with:

FamLst <- list(mammal = c("elephant", "dog"), bird = c("parrot", "eagle"))

You can then add to the list in bits. You can bring up a list of all the mammals with FamLst$mammal, for example. And if you want to test if "dog" is a member of the mammals, use "dog" %in% FamLst$mammal.

Upvotes: 0

Related Questions