danilinares
danilinares

Reputation: 1270

Programming using the tidyverse: speed issues

We released the package quickpsy a few years ago (paper in the R journal paper). The package used R base functions, but also made an extensive use of functions of what was called at that time the Hadleyverse. We are now developing a new version of the package that mostly uses functions from the tidyverse and that incorporates the new non-standard evaluation approach and found that the package is much much slower (more than four times slower). We found for example that purrr::map is much slower than dplyr::do (which is deprecated):

library(tidyverse)

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    do(head(., 2))
  )

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    nest() %>% 
    mutate(temp = map(data, ~head(., 2))) %>% 
    unnest(temp)
)

We also found that functions like pull are very slow.

We are not sure whether the tidyverse is not meant to be used for this type of programming or we are not using it properly.

Upvotes: 4

Views: 1246

Answers (2)

danilinares
danilinares

Reputation: 1270

For this particular example, the slowness caused by the nest and unnest computations can be solved using group_modify

system.time(
   mtcars %>% 
   group_by(cyl) %>% 
   group_modify(~head(., 2))
)

Upvotes: 0

Romain Francois
Romain Francois

Reputation: 17642

slice() is the proper tool to use if you want the first two rows of each group. Both do() and nest() %>% mutate(map()) %>% unnest() are too heavy and use more memory:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(purrr)

library(tidyverse)

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    do(head(., 2))
)
#>    user  system elapsed 
#>   0.065   0.003   0.075

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    nest() %>% 
    mutate(temp = map(data, ~head(., 2))) %>% 
    unnest(temp)
)
#>    user  system elapsed 
#>   0.024   0.000   0.024

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    slice(1:2)
)
#>    user  system elapsed 
#>   0.002   0.000   0.002

Created on 2018-10-23 by the reprex package (v0.2.1.9000)

See also benchmark results in this tidyr issue

Upvotes: 3

Related Questions