robertdj
robertdj

Reputation: 1117

Fast unnest with data.table

I am currently using the tidyr package to unnest list columns. However, I am looking for a faster approach and turned to data.table (where I am a noob). Consider the following example:

dt1 <- data.table::data.table(
    a = c("a1", "a2"),
    df1 = list(data.frame(
        b = c("b1", "b2")
    ))
)

tidyr::unnest(dt1, df1)
#> # A tibble: 4 x 2
#>   a     b    
#>   <chr> <chr>
#> 1 a1    b1   
#> 2 a1    b2   
#> 3 a2    b1   
#> 4 a2    b2

dt1[, data.table::rbindlist(df1), by = .(a)]
#>     a  b
#> 1: a1 b1
#> 2: a1 b2
#> 3: a2 b1
#> 4: a2 b2
Created on 2021-06-22 by the reprex package (v1.0.0)

I get the same result, but if I have a large data.table and more columns in by this approach gives worse performance with data.table than with tidyr. Can this be alleviated?

One follow-up question is how to unnest multiple columns with data.table. Consider this example:

dt2 <- data.table::data.table(
    a = c("a1", "a2"),
    df1 = list(data.frame(
        b = c("b1", "b2")
    )),
    df2 = list(data.frame(
        c = c("c1", "c2")
    ))
)

tidyr::unnest(dt2, c(df1, df2))
#> # A tibble: 4 x 3
#>   a     b     c    
#>   <chr> <chr> <chr>
#> 1 a1    b1    c1   
#> 2 a1    b2    c2   
#> 3 a2    b1    c1   
#> 4 a2    b2    c2
Created on 2021-06-22 by the reprex package (v1.0.0)

Using multiple arguments in data.table::rbindlist doesn't seem to work.

Update: After making a large(r) example to demonstrate my claim about execution time it turns out that tidyr is quite sensitive to whether the list column contains data.frames or data.tables:

n_inner <- 300
inner_df <- data.frame(
    d1 = seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2021-01-01"), length.out = n_inner),
    d2 = seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2021-01-01"), length.out = n_inner),
    d3 = rnorm(n_inner)
)

n_outer <- 400

dt <- data.table::data.table(
    a = sample(10, n_outer, replace = TRUE),
    b = seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2021-01-01"), length.out = n_outer),
    c = seq.POSIXt(as.POSIXct("2019-01-01"), as.POSIXct("2020-01-01"), length.out = n_outer),
    d = rep(list(inner_df), n_outer)
)

bench::mark(check = FALSE,
    tidyr = tidyr::unnest(dt, d),
    datatable = dt[, data.table::rbindlist(d), by = .(a, b, c)]
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidyr          14ms   18.7ms      53.2      18MB     26.6
#> 2 datatable    56.2ms   56.2ms      17.8    25.5MB    178.

inner_dt <- data.table::as.data.table(inner_df)
dt$d <- rep(list(inner_dt), n_outer)

bench::mark(check = FALSE,
    tidyr = tidyr::unnest(dt, d),
    datatable = dt[, data.table::rbindlist(d), by = .(a, b, c)]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidyr       202.2ms  209.3ms      4.40    28.4MB     19.1
#> 2 datatable    43.5ms   49.9ms     18.3     25.4MB     22.0

Created on 2021-06-22 by the reprex package (v1.0.0)

In my actual usecase I have nested data.frames since it comes from JSON parsed with RcppSimdJson and here tidyr is faster.

Upvotes: 5

Views: 1654

Answers (2)

GKi
GKi

Reputation: 39707

Maybe you can use a combination of base and data.table as it seems using data.table::rbindlist is faster than do.call with rbind. Have also a look on: How to speed up rbind?

For the given data in the update it would look like:

data.frame(dt[rep(seq_len(nrow(dt)), vapply(dt$d, nrow, 0L)),1:3],
  data.table::rbindlist(dt$d)

Benchmark based on the example given in the question:

f <- alist(tidyr = tidyr::unnest(dt, d)
 , datatable = dt[, data.table::rbindlist(d), by = .(a, b, c)]
 , base=do.call(rbind, lapply(seq_len(nrow(dt)), function(i) do.call(data.frame, dt[i,])))
 , base2=data.frame(dt[rep(seq_len(nrow(dt)), vapply(dt$d, nrow, 0L)),1:3], do.call(rbind, dt$d))
 , dtBase=data.frame(dt[rep(seq_len(nrow(dt)), vapply(dt$d, nrow, 0L)),1:3], data.table::rbindlist(dt$d)))

set.seed(42)
n_inner <- 300
inner_df <- data.frame(
    d1 = seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2021-01-01"), length.out = n_inner),
    d2 = seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2021-01-01"), length.out = n_inner),
    d3 = rnorm(n_inner)
)

n_outer <- 400

dt <- data.table::data.table(
    a = sample(10, n_outer, replace = TRUE),
    b = seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2021-01-01"), length.out = n_outer),
    c = seq.POSIXt(as.POSIXct("2019-01-01"), as.POSIXct("2020-01-01"), length.out = n_outer),
    d = rep(list(inner_df), n_outer)
)
inner_dt <- as.data.frame(inner_df) #Having data.frames in the dt
dt$d <- rep(list(inner_dt), n_outer)
do.call(bench::mark, c(f, check = FALSE))
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 tidyr        17.5ms  17.93ms     53.4    22.09MB     19.8    27    10
#2 datatable   45.52ms  50.54ms     17.2    25.59MB     27.5    10    16
#3 base       809.87ms 809.87ms      1.23    2.22GB    115.      1    93
#4 base2      290.01ms 294.97ms      3.39    1.12GB    173.      2   102
#5 dtBase       4.71ms   5.06ms    159.      10.6MB     69.4    80    35
inner_dt <- data.table::as.data.table(inner_df) #Having data.tables in the dt
dt$d <- rep(list(inner_dt), n_outer)
do.call(bench::mark, c(f, check = FALSE))
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 tidyr      285.56ms 285.94ms      3.50   28.32MB     15.7     2     9
#2 datatable   45.73ms  48.67ms     16.7     25.3MB     18.5     9    10
#3 base       784.33ms 784.33ms      1.27    2.23GB    105.      1    82
#4 base2        4.61ms   4.83ms    166.     10.62MB     50.0    83    25
#5 dtBase       4.75ms   5.02ms    158.      10.6MB     49.9    79    25

Currently it looks like using a combination of base and data.table is fastest independent if it has to work with data.frame's or data.table's.

Upvotes: 3

GKi
GKi

Reputation: 39707

Just making a benchmark showing differences with solutions for data.table and tidyr given in question another way for data.table and a base solution.

DT <- data.table::data.table(
    a = c("a1", "a2"),
    df1 = list(data.frame(
        b = c("b1", "b2")
    ))
)
n <- 1e5
set.seed(42)
dt1 <- DT[sample(seq_len(nrow(DT)), n, TRUE),]

bench::mark(check = FALSE
          , tidyr = tidyr::unnest(dt1, df1)
          , dt = dt1[, data.table::rbindlist(df1), by = .(a)]
          , dt2 = dt1[, unlist(df1, TRUE, FALSE), .(a)]
          , base = data.frame(a=rep(dt1$a, lapply(dt1$df1, nrow)), do.call(rbind, dt1$df1))
          , base2 = data.frame(a=rep(dt1$a, lapply(dt1$df1, nrow)), b=unlist(dt1$df1, TRUE, FALSE))
      )
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 tidyr         1.03s    1.03s     0.971   22.59MB     7.76     1     8
#2 dt           46.9ms  50.15ms    17.1     15.01MB     9.47     9     5
#3 dt2         11.66ms  13.66ms    70.8     14.03MB    35.4     36    18
#4 base          3.47s    3.47s     0.288   43.23MB    12.1      1    42
#5 base2       353.9ms 363.41ms     2.75     4.58MB    11.0      2     8

So data.table is here for both ways the fastest followed by one base solution, then tidyr and then the other base solution.

Upvotes: 6

Related Questions