Reputation: 703
I have some data where each data point is associated with a character vector of varying length. For example, it might be generated by the following function:
library(tidyverse)
set.seed(27)
generate_keyset <- function(...) {
sample(LETTERS[1:5], size = rpois(n = 1, lambda = 10), replace = TRUE)
}
generate_keyset()
#> [1] "A" "C" "A" "A" "A" "A" "A" "E" "C" "C" "A" "D" "A" "D" "C" "A"
I would like to summarize this keyset by converting it to a single number score. The way this works is straightforward: each key in the keyset has a value, and to get the value of the entire keyset I sum over the values. The key-value map is a tibble with several hundred entries, but you can imagine it looks like:
key_value_map <- tribble(
~key, ~value,
"A", 1,
"B", -2,
"C", 8,
"D", -4,
"E", 0
)
Currently I am scoring keysets with the following function:
score_keyset <- function(keyset) {
merged_keysets_to_map <- data.frame(
key = keyset,
stringsAsFactors = FALSE
) %>%
left_join(key_value_map, by = "key")
sum(merged_keysets_to_map$value)
}
score_keyset(LETTERS[1:4])
#> [1] 3
This works fine, except it is very slow, and I need to do this operation about a million times. For example, I would like the following to be much faster:
n <- 1e4 # in practice I have n = 1e6
fake_data <- tibble(
keyset = map(1:n, generate_keyset)
)
library(tictoc)
tic()
scored_data <- fake_data %>%
mutate(
value = map_dbl(keyset, score_keyset)
)
toc()
I am sure this is some much better way to do this with indexing but it is escaping me at the moment. Help speeding this up is much appreciated.
Upvotes: 2
Views: 39
Reputation: 887301
Instead of doing a join and then sum, it would be more efficient if we use a named vector to match
library(tibble)
sum(deframe(key_value_map)[generate_keyset()])
Checking the timings, the OP's tic/toc showed 45.728 sec
tic()
v1 <- deframe(key_value_map)
scored_data2 <- fake_data %>%
mutate(
value = map_dbl(keyset, ~ sum(v1[.x]))
)
toc()
#0.952 sec elapsed
identical(scored_data, scored_data2)
#[1] TRUE
Upvotes: 3