Reputation: 155
I need to translate the values in a vector according to a mapping of key value pairs:
vector <- c("dog","ant","eagle","ant","eagle","parrot")
"dog" "ant" "eagle" "ant" "eagle" "parrot"
mapping <- data.frame(key=c("dog","cat","elephant","ant","parrot","eagle"),value=c("mammal","mammal","mammal","insect","bird","bird"))
key value
dog mammal
cat mammal
elephant mammal
ant insect
parrot bird
eagle bird
The desired output would be like this:
output <- ("mammal", "insect", "bird", "insect", "bird", "bird")
In the real dataset I have to translate ~10000 input vectors of an average length of ~15 and the mapping data-frame is in the range of a million keys with about 100000 unique classes on the side of the values.
The problem itself appears rather basic to me, but the bottleneck is runtime. In other programming languages you would probably use a HashMap for the mapping and then loop through the vector. Any solution in R I could come up with so far is orders of magnitude slower than a simple HashMap-based one in Java or Python (see comments below).
Is there a more efficient data structure to store the mapping than a data frame?
What would be the most runtime-efficient solution to this problem in R?
Upvotes: 7
Views: 2791
Reputation: 11
You could use a named vector to store the key (vector names) value (vector values) pairs and then grab the desired values by the desired names.
animals = c('dog', 'ant', 'eagle', 'ant', 'eagle', 'parrot')
key_value = c('dog' = 'mammal',
'cat' = 'mammal',
'elephant' = 'mammal',
'ant' = 'insect',
'parrot' = 'bird',
'eagle' = 'bird')
key_value[animals]
The result is a named vector, with the original keys as the names:
dog ant eagle ant eagle parrot
"mammal" "insect" "bird" "insect" "bird" "bird"
The previous poster gave this solution, which was simplified into the code above.
animals = c('dog', 'ant', 'eagle', 'ant', 'eagle', 'parrot')
key_value = list('dog' = 'mammal',
'cat' = 'mammal',
'elephant' = 'mammal',
'ant' = 'insect',
'parrot' = 'bird',
'eagle' = 'bird')
unlist(lapply(animals, FUN = function(x){key_value[[x]]}))
> unlist(lapply(animals, FUN = function(x){key_value[[x]]}))
[1] "mammal" "insect" "bird" "insect" "bird" "bird"
Upvotes: 1
Reputation: 18691
There is a package called hashmap
which is perfect for this:
library(hashmap)
hash_lookup = hashmap(mapping$key, mapping$value)
output = hash_lookup[[vector]]
Result:
> hash_lookup
## (character) => (character)
## [cat] => [mammal]
## [elephant] => [mammal]
## [ant] => [insect]
## [dog] => [mammal]
## [eagle] => [bird]
## [parrot] => [bird]
> output
[1] "mammal" "insect" "bird" "insect" "bird" "bird"
Data:
vector <- c("dog","ant","eagle","ant","eagle","parrot")
mapping <- data.frame(key=c("dog","cat","elephant","ant","parrot","eagle"),
value=c("mammal","mammal","mammal","insect","bird","bird"),
stringsAsFactors = FALSE)
Note:
Have to test this on a bigger dataset, but this method should be very fast since it is implemented with Rcpp internally.
Upvotes: 5
Reputation: 1
One option would be to factor the vector and change the levels.
mapping = data.table(mapping)
setkey(mapping, key)
vector = factor(vector)
levels(vector) = mapping[levels(vector),value]
Upvotes: 0
Reputation: 1819
What about in a list? Start with:
FamLst <- list(mammal = c("elephant", "dog"), bird = c("parrot", "eagle"))
You can then add to the list in bits. You can bring up a list of all the mammals with FamLst$mammal
, for example. And if you want to test if "dog"
is a member of the mammals, use "dog" %in% FamLst$mammal
.
Upvotes: 0