How to optimize memory usage in dplyr + purrr

Question

I have an issue where after replicating data for a training and testing set, I'm showing a large amount of memory allocated to my user in Rstudio, but not being used in my R session. I've created a small example to reproduce my situation :)

This code runs a bunch of model, based on different formulas, algorithms, and parameter sets that I give it. It is a function, but I've created a simple script for reprex.

library(dplyr)
library(purrr)
library(modelr)
library(tidyr)
library(pryr)

# set my inputs
data <- mtcars
formulas <- c(test1 = mpg ~ cyl + wt + hp,
              test2 = mpg ~ cyl + wt)
params = list()
methods <- "lm"

n <- 20 # num of cv splits
mult <- 10 # number of times I want to replicate some of the data
frac <- .25 # how much I want to cut down other data (fractional)

### the next few chunks get the unique combos of the inputs.
if (length(params) != 0) {
  cross_params <- params %>% 
    map(cross) %>% 
    map_df(enframe, name = "param_set", .id = "method") %>% 
    list
} else cross_params <- NULL

methods_df <- tibble(method = methods) %>% 
  list %>% 
  append(cross_params)  %>% 
  reduce(left_join, by = "method") %>% 
  split(1:nrow(.))

# wrangle formulas into a split dataframe
formulas_df <- tibble(formula = formulas,
                      name = names(formulas)) %>% 
  split(.$name)

# split out the data into n random train-test combos
cv_data <- data %>% 
  crossv_kfold(n) %>% # rsample?
  mutate_at(vars(train:test), ~map(.x, as_tibble))

# sample out if needed
cv_data_samp <- cv_data %>%
  mutate(train = modify(train, 
                        ~ .x %>% 
                          split(.$gear == 4) %>% 
                          # take a sample of the non-vo data
                          modify_at("FALSE", sample_frac, frac) %>% 
                          # multiply out the vo-on data
                          modify_at("TRUE", function(.df) {
                            map_df(seq_along(1:mult), ~ .df) 
                          }) %>% 
                          bind_rows))

# get all unique combos of formula and method
model_combos <- list(cv = list(cv_data_samp), 
                     form = formulas_df, 
                     meth = methods_df) %>% 
  cross %>%
  map_df(~ bind_cols(nest(.x$cv), .x$form, .x$meth)) %>% 
  unnest(data, .preserve = matches("formula|param|value")) %>% 
  {if ("value" %in% names(.)) . else mutate(., value = list(NULL))} 

# run the models
model_combos %>% 
  # put all arguments into a single params column
  mutate(params = pmap(list(formula = formula, data = train), list)) %>%
  mutate(params = map2(params, value, ~ append(.x, .y))) %>%
  mutate(params = modify(params, discard, is.null)) %>%
  # run the models
  mutate(model = invoke_map(method, params))  

mem_change(rm(data, cv_data, cv_data_samp))
mem_used()

Now after I do this, my mem_used comes out to 77.3mb, but I'm seeing roughly double that (160Mb) allocated to my R user. This really explodes when my data is 3 Gb, which is my real-life case. I end up using 100Gb and tying up a whole server :(.

What is going on and how can I optimize?

Any help appreciated!!!

Zafar · Accepted Answer

I figured this out! The issue was that I was converting my series of modelr resample objects to tibbles and that was exploding the memory even though I was subsequently sampling them down. The solution? Write methods for dealing with resample objects so that I never have to convert resample objects to tibble. These looked like:

# this function just samples the indexes instead of the data
sample_frac.resample <- function(data, frac) {
  data$idx <- sample(data$idx, frac * length(data$idx))
  data
}

# this function replicates the indexes. I should probably call it something else.
augment.resample <- function(data, n) {
  data$idx <- unlist(map(seq_along(1:n), ~ data$idx))
  data
}

# This function does simple splitting (logical only) of resample obejcts
split.resample <- function(data, .p) {
  pos <- list(data = data$data, idx = which(.p, 1:nrow(data$data)))
  neg <- list(data = data$data, idx = which(!.p, 1:nrow(data$data)))
  class(pos) <- "resample"
  class(neg) <- "resample"
  list("TRUE" = pos,
       "FALSE" = neg)
}

# This function takes the equivalent of a `bind_rows` for resample objects.
# Since bind rows does not call `useMethod` I had to call it something else
bind <- function(data) {
  out <- list(data = data[[1]]$data, idx = unlist(map(data, pluck, "idx")))
  class(out) <- "resample"
  out
}

Then I just converted to a tibble in the same purrr closure in which my model is run for that CV. Problem solved! My memory usage is VERY low now.

How to optimize memory usage in dplyr + purrr

Answers (1)

Related Questions