Misha
Misha

Reputation: 3126

working with list columns in r (data.table/dplyr)

I have a large dataset organized in listcolumns (20Gb).

1-Is there any other tricks I can use to speed up the following code? Data.table seems to provide a two-fold increase, but I assume there are other tricks that can be used in similar scenarios.

2-Apart from saveRDS - is there any other faster file-libraries (vroom, fst and fwrite do not seem to support listcols) supporting listcolumns?

3-I tried dt[,.(test=tib_sort[tib_sort[, .I[.N]], stringi::stri_sub(dt$ch, length = 5)],by=id)]but it throws an incorrect number of dimensions error. Is there an option to do by using a listcolumn and automatically have it setDT and key/index?

library(dplyr)
library(purrr)
library(data.table)
library(tictoc)

Toy data

set.seed(123)
tib <-
  tibble(id = 1:20000) %>% mutate(k = map_int(id,  ~ sample(c(10:30), 1)))
tib <-
  tib %>% mutate(tib_df = map(k,  ~ tibble(
    ch = replicate(.x, paste0(
      sample(letters[1:24],
             size = sample(c(10:20), 1)),
      collapse = ""
    )),
    num = sample(1:1e10, .x,replace = F)
  )))

Dplyr

help <- function(df) {
  df <- df %>% top_n(1, num) %>% select(ch)
  stringi::stri_sub(df, length = 5)
}
tic("purrr")
tib <- tib %>% mutate(result = map_chr(tib_df, help))
toc(log = T, quiet = T)

Data.table

dt <- copy(tib)
setDT(dt)
tic("setDT w key")
dt[, tib_df := lapply(tib_df, setDT)]
dt[, tib_sort := lapply(tib_df, function(x)
  setindex(x, "num"))]
toc(log = T, quiet = T)
tic("dt w key")
dt[, result_dt_key := sapply(tib_df, function(x) {
  x[x[, .I[.N]], stringi::stri_sub(ch, length = 5)]
})]
toc(log=T, quiet = T)

Timing

    tic.log(format = T)
[[1]]
[1] "purrr: 25.499 sec elapsed"

[[2]]
[1] "setDT w key: 4.875 sec elapsed"

[[3]]
[1] "dt w key: 12.077 sec elapsed"

EDITS and update after also including unnested versions from both dplyr and data.table

1 purrr: 25.824 sec elapsed          
2 setDT wo key: 2.97 sec elapsed     
3 dt wo key: 13.724 sec elapsed      
4 setDT w key: 1.778 sec elapsed     
5 dt w key: 11.489 sec elapsed       
6 dplyr,unnest: 1.496 sec elapsed    
7 dt,I,unnest: 0.329 sec elapsed     
8 dt, join, unnest: 0.325 sec elapsed

tic("dt, join, unnest")
b <- unnest(tib)
setDT(b)
unnest.J <- b[b[, .(num=max(num)), by = 'id'], on=c('id','num')][,r2:=stringi::stri_sub(ch,length=5)][]
toc(log=T)

 res <- list(unnest.J$r2,tib2$result2,dt$result_dt_key,dt$result_dt,tib$result)
 sapply(res,identical,unnest.I$r2)
[1] TRUE TRUE TRUE TRUE TRUE

So- I guess the lesson learnt is that although listcolumns look deceivingly tempting as data structures for analysis, they are much slooower.

enter image description here

Upvotes: 3

Views: 367

Answers (1)

JBGruber
JBGruber

Reputation: 12410

Operating on list columns tends to be slow, as the functions have to iterate over the entries and can't be vectorized. Usually it thus makes sense to unnest list columns:

tic("unnest")
tib2 <- tib %>% 
  tidyr::unnest(tib_df) %>%
  group_by(id) %>% 
  top_n(1, num) %>% 
  mutate(result = stringi::stri_sub(ch, length = 5))
toc(log = T, quiet = T)

Result:

> tic.log(format = T)
[[1]]
[1] "purrr: 39.54 sec elapsed"

[[2]]
[1] "setDT w key: 10.7 sec elapsed"

[[3]]
[1] "dt w key: 19.19 sec elapsed"

[[4]]
[1] "unnest: 1.62 sec elapsed"

With your toy data, the final object is only slightly different. But if it is necessary to get the original form back, you might want to do something like this:

tic("unnest+reattach")
tib2 <- tib %>% 
  tidyr::unnest(tib_df) %>%
  group_by(id) %>% 
  top_n(1, num) %>% 
  mutate(result = stringi::stri_sub(ch, length = 5))

tib3 <- tib %>% 
  mutate(result = tib2$result[match(id, tib2$id)])

toc(log = T, quiet = T)

tic.log(format = T)[[5]]
[1] "unnest+reattach: 1.83 sec elapsed"

Upvotes: 4

Related Questions