Efficient random sampling in R

Question

From a data frame, I am trying randomly sample 1:20 observations where for each number of observation I would like to replicate the process 4 times. I came up with this working solution, but it is very slow since it is involving coping many times a large data frame because of the crossing() function. Anyone can point me toward a more efficient solution?

library(tidyverse)

mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 
  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {

    data %>%
      sample_n(n, replace = TRUE) %>%
      summarise(mean_mpg = mean(mpg)) %>%
      pull(mean_mpg)

  }))
#> # A tibble: 240 x 5
#>      cyl data              n_random_sample n_replicate   res
#>                                   
#>  1     6                1           1  17.8
#>  2     6                1           2  21  
#>  3     6                1           3  19.2
#>  4     6                1           4  18.1
#>  5     6                2           1  19.6
#>  6     6                2           2  19.4
#>  7     6                2           3  19.6
#>  8     6                2           4  20.4
#>  9     6                3           1  20.1
#> 10     6                3           2  18.9
#> # ... with 230 more rows

^{Created on 2018-11-19 by the reprex package (v0.2.1)}

EDIT: I am now working with a much larger dataset. Would it be possible to do it more efficiently with data.table?

AntoniosK · Accepted Answer

This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest to create the sub-datasets and store them as a list variable and then pick a sample using map:

library(tidyverse)

# create function to sample rows
f = function(c, n) {
  mtcars %>%
    filter(cyl == c) %>%
    sample_n(n, replace = TRUE) %>%
    summarise(mean_mpg = mean(mpg)) %>%
    pull(mean_mpg)
}

# vectorise function
f = Vectorize(f)

# set seed for reproducibility
set.seed(11)

tbl_df(mtcars) %>%
  distinct(cyl) %>%
  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
  mutate(res = f(cyl, n_random_sample))

# # A tibble: 240 x 4
#     cyl n_random_sample n_replicate   res
#                      
# 1     6               1           1  21  
# 2     6               1           2  21  
# 3     6               1           3  18.1
# 4     6               1           4  21  
# 5     6               2           1  20.4
# 6     6               2           2  21.2
# 7     6               2           3  20.4
# 8     6               2           4  19.6
# 9     6               3           1  18.4
#10     6               3           2  19.6
# # ... with 230 more rows

Efficient random sampling in R

Answers (2)

Related Questions