Reputation: 3126
I have a large dataset organized in listcolumns (20Gb).
1-Is there any other tricks I can use to speed up the following code? Data.table seems to provide a two-fold increase, but I assume there are other tricks that can be used in similar scenarios.
2-Apart from saveRDS - is there any other faster file-libraries (vroom, fst and fwrite do not seem to support listcols) supporting listcolumns?
3-I tried dt[,.(test=tib_sort[tib_sort[, .I[.N]], stringi::stri_sub(dt$ch, length = 5)],by=id)]
but it throws an incorrect number of dimensions error. Is there an option to do by
using a listcolumn and automatically have it setDT and key/index?
library(dplyr)
library(purrr)
library(data.table)
library(tictoc)
set.seed(123)
tib <-
tibble(id = 1:20000) %>% mutate(k = map_int(id, ~ sample(c(10:30), 1)))
tib <-
tib %>% mutate(tib_df = map(k, ~ tibble(
ch = replicate(.x, paste0(
sample(letters[1:24],
size = sample(c(10:20), 1)),
collapse = ""
)),
num = sample(1:1e10, .x,replace = F)
)))
help <- function(df) {
df <- df %>% top_n(1, num) %>% select(ch)
stringi::stri_sub(df, length = 5)
}
tic("purrr")
tib <- tib %>% mutate(result = map_chr(tib_df, help))
toc(log = T, quiet = T)
dt <- copy(tib)
setDT(dt)
tic("setDT w key")
dt[, tib_df := lapply(tib_df, setDT)]
dt[, tib_sort := lapply(tib_df, function(x)
setindex(x, "num"))]
toc(log = T, quiet = T)
tic("dt w key")
dt[, result_dt_key := sapply(tib_df, function(x) {
x[x[, .I[.N]], stringi::stri_sub(ch, length = 5)]
})]
toc(log=T, quiet = T)
tic.log(format = T)
[[1]]
[1] "purrr: 25.499 sec elapsed"
[[2]]
[1] "setDT w key: 4.875 sec elapsed"
[[3]]
[1] "dt w key: 12.077 sec elapsed"
1 purrr: 25.824 sec elapsed
2 setDT wo key: 2.97 sec elapsed
3 dt wo key: 13.724 sec elapsed
4 setDT w key: 1.778 sec elapsed
5 dt w key: 11.489 sec elapsed
6 dplyr,unnest: 1.496 sec elapsed
7 dt,I,unnest: 0.329 sec elapsed
8 dt, join, unnest: 0.325 sec elapsed
tic("dt, join, unnest")
b <- unnest(tib)
setDT(b)
unnest.J <- b[b[, .(num=max(num)), by = 'id'], on=c('id','num')][,r2:=stringi::stri_sub(ch,length=5)][]
toc(log=T)
res <- list(unnest.J$r2,tib2$result2,dt$result_dt_key,dt$result_dt,tib$result)
sapply(res,identical,unnest.I$r2)
[1] TRUE TRUE TRUE TRUE TRUE
So- I guess the lesson learnt is that although listcolumns look deceivingly tempting as data structures for analysis, they are much slooower.
Upvotes: 3
Views: 367
Reputation: 12410
Operating on list columns tends to be slow, as the functions have to iterate over the entries and can't be vectorized. Usually it thus makes sense to unnest
list columns:
tic("unnest")
tib2 <- tib %>%
tidyr::unnest(tib_df) %>%
group_by(id) %>%
top_n(1, num) %>%
mutate(result = stringi::stri_sub(ch, length = 5))
toc(log = T, quiet = T)
Result:
> tic.log(format = T)
[[1]]
[1] "purrr: 39.54 sec elapsed"
[[2]]
[1] "setDT w key: 10.7 sec elapsed"
[[3]]
[1] "dt w key: 19.19 sec elapsed"
[[4]]
[1] "unnest: 1.62 sec elapsed"
With your toy data, the final object is only slightly different. But if it is necessary to get the original form back, you might want to do something like this:
tic("unnest+reattach")
tib2 <- tib %>%
tidyr::unnest(tib_df) %>%
group_by(id) %>%
top_n(1, num) %>%
mutate(result = stringi::stri_sub(ch, length = 5))
tib3 <- tib %>%
mutate(result = tib2$result[match(id, tib2$id)])
toc(log = T, quiet = T)
tic.log(format = T)[[5]]
[1] "unnest+reattach: 1.83 sec elapsed"
Upvotes: 4