Matthew Hui
Matthew Hui

Reputation: 664

Object sizes increase after passing them to functions

I am making some plots from large datasets. In this code the sizes of the resultant required plot objects are very small, but the increased usage of memory is much more than that.

My findings so far, is that the increase in memory usage seems to be due to a few objects. In particular, the value of the object tab_ind does not change after the graph plotting process (checked using the identical() function), but its size increases significantly after the process (checked using the object.size() function). The only thing I do with tab_ind during the process, is passing it to functions as arguments.


REPRODUCIBLE EXAMPLE

The size of simulation can be controlled by varying N. At the end of the run, the change in sizes and check for identicality of tab_ind are printed.

library(data.table)
library(magrittr)
library(ggplot2)

N <- 6000

set.seed(runif(1, 0, .Machine$integer.max) %>% ceiling)

logit <- function(x) {return(log(x/(1-x)))}
invLogit <- function(x) {return(exp(x)/(1+exp(x)))}

tab_dat <- data.table(datasetID = seq(N), MIX_MIN_SUCCESS = sample(c(0, 1), N, replace = T), MIX_ALL = sample(c(0, 1), N, replace = T))
tab_dat[MIX_MIN_SUCCESS == 0, MIX_ALL := 0]
n <- sample(20:300, N, replace = T)
tab_ind <- data.table(
  datasetID = rep(seq(N), times = n),
  SIM_ADJ_PP1 = runif(sum(n), 0.00001, 0.99999),
  MIX_ADJ_PP1 = runif(sum(n), 0.00001, 0.99999)
)
tab_ind[, c("SIM_ADJ_LOGIT_PP1", "MIX_ADJ_LOGIT_PP1") := list(logit(SIM_ADJ_PP1), logit(MIX_ADJ_PP1))]

checkMem_gc <- function(status) {
  print(status)
  print(memory.size())
  gc()
  print(memory.size())
} 

## Individual bins for x and y
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv, by = "quantile") {
  #Binning
  if (by == "even") {
    checkMem_gc("start x-y breaks")
    checkMem_gc("start x breaks")
    minN = dt[, min(get(x), na.rm = T)]
    checkMem_gc("after x min")
    maxN = dt[, max(get(x), na.rm = T)]
    checkMem_gc("after x max")
    xBreaks = seq(minN, maxN, length.out = xNItv + 1)
    checkMem_gc("after seq")
    checkMem_gc("after x breaks")
    yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
    checkMem_gc("after y breaks")
  } else if (by == "quantile") {
    xBreaks = dt[, quantile(get(x), seq(0, 1, length.out = xNItv + 1), names = F)]
    yBreaks = dt[, quantile(get(y), seq(0, 1, length.out = yNItv + 1), names = F)]
  } else {stop("type of 'by' not support")}
  checkMem_gc("after x-y breaks")
  xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
  checkMem_gc("after x binCode")
  xbinMid = sapply(seq(xNItv), function(i) {return(mean(xBreaks[c(i, i+1)]))})[xbinCode]
  checkMem_gc("after x binMid")
  ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
  checkMem_gc("after y binCode")
  ybinMid = sapply(seq(yNItv), function(i) {return(mean(yBreaks[c(i, i+1)]))})[ybinCode]
  checkMem_gc("after y binMid")
  #Creating table
  tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
  checkMem_gc("after tab match")
  tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
    tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
    ]
  checkMem_gc("after tab plot")
  colnames(tab_plot)[colnames(tab_plot) == "xbinCode"] = paste0(x, "_binCode")
  colnames(tab_plot)[colnames(tab_plot) == "xbinMid"] = paste0(x, "_binMid")
  colnames(tab_plot)[colnames(tab_plot) == "ybinCode"] = paste0(y, "_binCode")
  colnames(tab_plot)[colnames(tab_plot) == "ybinMid"] = paste0(y, "_binMid")
  checkMem_gc("after col name")
  rm(list = c("xBreaks", "yBreaks", "xbinCode", "ybinCode", "xbinMid", "ybinMid", "tab_match"))
  checkMem_gc("after rm")
  #Returning table
  return(tab_plot)
}

tab_by_obin_x_str_y <- function(dt, x, y, width, Nbin, by = "even") {
  #Binning
  if (by == "even") {
    xLLim = dt[, seq(min(get(x), na.rm = T), max(get(x), na.rm = T) - width, length.out = Nbin)]
    xULim = dt[, seq(min(get(x), na.rm = T) + width, max(get(x), na.rm = T), length.out = Nbin)]
  } else if (by == "quantile") {
    xLLim = dt[, quantile(get(x), seq(0, 1 - width, length.out = Nbin), names = F)]
    xULim = dt[, quantile(get(x), seq(width, 1, length.out = Nbin), names = F)]
  } else {stop("type of 'by' not support")}
  xbinMid = (xLLim + xULim) / 2
  #summarizing y
  tab_out <- sapply(seq(Nbin), function(i) {
    dt[get(x) >= xLLim[i] & get(x) <= xULim[i], c(mean(get(y), na.rm = T), sd(get(y), na.rm = T),
                                                  quantile(get(y), c(0.025, 0.975), names = F))]
  }) %>% t %>% as.data.table %>% set_colnames(., c("mean", "sd", ".025p", ".975p")) %>%
    cbind(data.table(binCode = seq(Nbin), xLLim, xbinMid, xULim), .)
  tab_out[, c("mean_plus_1sd", "mean_minus_1sd") := list(mean + sd, mean - sd)]
  return(tab_out)
}

plotEnv <- new.env()
backupEnv <- new.env()

gc()
gc()
checkMem_gc("Starting memory size checking")
start.mem.size <- memory.size()
start_ObjSizes <- sapply(ls(), function(x) {object.size(get(x))})
start_tab_ind <- tab_ind
start_tab_ind_size <- object.size(tab_ind)
dummyEnv <- new.env()
with(dummyEnv, {
  ## Set function for analyses against SIM_PP1
  fcn_SIM_PP1 <- function(dt, newTab = T) {
    dat_prob = tab_by_bin_idxy(dt, x = "SIM_ADJ_PP1", y = "MIX_ADJ_PP1", xNItv = 50, yNItv = 50, by = "even")
    checkMem_gc("after tab prob")
    dat_logit = tab_by_bin_idxy(dt, x = "SIM_ADJ_LOGIT_PP1", y = "MIX_ADJ_LOGIT_PP1",
                                xNItv = 50, yNItv = 50, by = "even")
    checkMem_gc("after tab logit")

    if ((!newTab) && exists("summarytab_logit_SIM_ADJ_PP1", where = backupEnv) && 
        exists("summarytab_prob_SIM_ADJ_PP1", where = backupEnv)) {
      summarytab_logit = get("summarytab_logit_SIM_ADJ_PP1", envir = backupEnv)
      summarytab_prob = get("summarytab_prob_SIM_ADJ_PP1", envir = backupEnv)
    } else {
      summarytab_logit = tab_by_obin_x_str_y(dt, x = "SIM_ADJ_LOGIT_PP1", y = "MIX_ADJ_LOGIT_PP1",
                                             width = 0.05, Nbin = 1000, by = "even") 
      summarytab_prob = summarytab_logit[, .(
        binCode, invLogit(xLLim), invLogit(xbinMid), invLogit(xULim), invLogit(mean), sd,
        invLogit(`.025p`), invLogit(`.975p`), invLogit(mean_plus_1sd), invLogit(mean_minus_1sd)
      )] %>% set_colnames(colnames(summarytab_logit))
      assign("summarytab_logit_SIM_ADJ_PP1", summarytab_logit, envir = backupEnv)
      assign("summarytab_prob_SIM_ADJ_PP1", summarytab_prob, envir = backupEnv)
    }
    checkMem_gc("after summary tab")

    plot_prob <- ggplot(dat_prob, aes(x = SIM_ADJ_PP1_binMid)) +
      geom_vline(xintercept = 1, linetype = "dotted") +
      geom_hline(yintercept = 1, linetype = "dotted") +
      geom_abline(slope = 1, intercept = 0, size = 1.5, linetype = "dashed", alpha = 0.5) +
      geom_point(aes(y = MIX_ADJ_PP1_binMid, size = N), alpha = 0.5, na.rm = T) +
      geom_line(data = summarytab_prob, aes(x = xbinMid, y = mean), size = 1.25, color = "black", na.rm = T) +
      geom_line(data = summarytab_prob, aes(x = xbinMid, y = mean_plus_1sd), size = 1.25, color = "blue", na.rm = T, linetype = "dashed") +
      geom_line(data = summarytab_prob, aes(x = xbinMid, y = mean_minus_1sd), size = 1.25, color = "blue", na.rm = T, linetype = "dashed") +
      scale_size_continuous(range = c(0.5, 5)) +
      scale_x_continuous(name = "Simulated PP", breaks = seq(0, 1, 0.25),
                         labels = c("0%", "25%", "50%", "75%", "100%")) +
      scale_y_continuous(name = "Estimated PP", limits = c(0, 1), breaks = seq(0, 1, 0.25),
                         labels = c("0%", "25%", "50%", "75%", "100%")) +
      theme_classic() +
      theme(axis.title = element_text(size = 18),
            axis.text = element_text(size = 16))

    checkMem_gc("after plot prob")
    rm(dat_prob)
    rm(summarytab_prob)
    checkMem_gc("after removing dat_prob and summary_prob")

    plot_logit <- ggplot(dat_logit, aes(x = SIM_ADJ_LOGIT_PP1_binMid)) +
      geom_abline(slope = 1, intercept = 0, size = 1.5, linetype = "dashed", alpha = 0.5) +
      geom_point(aes(y = MIX_ADJ_LOGIT_PP1_binMid, size = N), alpha = 0.5, na.rm = T) +
      geom_line(data = summarytab_logit, aes(x = xbinMid, y = mean), size = 1.25, color = "black", na.rm = T) +
      geom_line(data = summarytab_logit, aes(x = xbinMid, y = mean_plus_1sd), size = 1.25, color = "blue", na.rm = T, linetype = "dashed") +
      geom_line(data = summarytab_logit, aes(x = xbinMid, y = mean_minus_1sd), size = 1.25, color = "blue", na.rm = T, linetype = "dashed") +
      scale_size_continuous(range = c(0.5, 5)) +
      scale_x_continuous(name = "Simulated LOGIT PP1",
                         breaks = c(0.00001, 0.001, 0.05, 0.5, 0.95, 0.999, 0.99999) %>% logit,
                         labels = c("0.001%", "0.1%", "5%", "50%", "95%", "99.9%", "99.999%")) +
      scale_y_continuous(name = "Estimated LOGIT PP1", limits = c(-12, 12),
                         breaks = c(0.00001, 0.001, 0.05, 0.5, 0.95, 0.999, 0.99999) %>% logit,
                         labels = c("0.001%", "0.1%", "5%", "50%", "95%", "99.9%", "99.999%")) +
      theme_classic() +
      theme(axis.title = element_text(size = 18),
            axis.text = element_text(size = 16))

    checkMem_gc("after plot logit")
    rm(summarytab_logit)
    rm(dat_logit)
    checkMem_gc("after removing dat_logit and summary_logit")

    return(list(plot_prob, plot_logit))
  }

  checkMem_gc("after defining function")

  ## Tabling

  tab_stat <- tab_ind[, c("MIX_MIN_SUCCESS", "MIX_ALL") := list(
    tab_dat[tab_ind[, datasetID], MIX_MIN_SUCCESS],
    tab_dat[tab_ind[, datasetID], MIX_ALL]
  )]
  checkMem_gc("after new tab_stat")

  tab_stat_MIN_SUCCESS <- tab_stat[MIX_MIN_SUCCESS == 1]
  checkMem_gc("after new new tab_stat_MIN_SUCCESS")

  tab_stat_MIX_ALL <- tab_stat[MIX_ALL == 1]
  checkMem_gc("after new tab_stat_MIX_ALL")

  # Generating ggplot objects
  print("--- start lst full ---")
  lst_full <- fcn_SIM_PP1(tab_stat, newTab = F)
  checkMem_gc("after lst full")
  rm(tab_stat)
  checkMem_gc("after rm tab_stat")

  print("--- start lst MIN_SUCCESS ---")
  lst_MIN_SUCCESS <- fcn_SIM_PP1(tab_stat_MIN_SUCCESS, newTab = F)
  checkMem_gc("after lst MIN_SUCCESS")
  rm(tab_stat_MIN_SUCCESS)
  checkMem_gc("after rm tab_MIN_SUCCESS")

  print("--- start lst MIX_ALL ---")
  lst_MIX_ALL <- fcn_SIM_PP1(tab_stat_MIX_ALL, newTab = F)
  checkMem_gc("after lst MIX_ALL")
  rm(tab_stat_MIX_ALL)
  checkMem_gc("after rm tab_stat_MIX_ALL")

  ## Start plotting
  print("--- Start plotting ---")
  assign("full_sp_MIX_ADJ_PP1_vs_SIM_ADJ_PP1", lst_full[[1]], envir = plotEnv)
  checkMem_gc("after assign1")
  assign("full_sp_MIX_ADJ_LOGIT_PP1_vs_SIM_ADJ_LOGIT_PP1", lst_full[[2]], envir = plotEnv)
  checkMem_gc("after assign2")
  rm(lst_full)
  checkMem_gc("after removing lst_full")
  assign("MIN_SUCCESS_sp_MIX_ADJ_PP1_vs_SIM_ADJ_PP1", lst_MIN_SUCCESS[[1]], envir = plotEnv)
  checkMem_gc("after assign3")
  assign("MIN_SUCCESS_sp_MIX_ADJ_LOGIT_PP1_vs_SIM_ADJ_LOGIT_PP1", lst_MIN_SUCCESS[[2]], envir = plotEnv)
  checkMem_gc("after assign4")
  rm(lst_MIN_SUCCESS)
  checkMem_gc("after removing lst_MIN_SUCCESS")
  assign("MIX_ALL_sp_MIX_ADJ_PP1_vs_SIM_ADJ_PP1", lst_MIX_ALL[[1]], envir = plotEnv)
  checkMem_gc("after assign5")
  assign("MIX_ALL_sp_MIX_ADJ_LOGIT_PP1_vs_SIM_ADJ_LOGIT_PP1", lst_MIX_ALL[[2]], envir = plotEnv)
  checkMem_gc("after assign6")
  rm(lst_MIX_ALL)
  checkMem_gc("after removing lst_MIX_ALL")
})

checkMem_gc("--- Finishing ---")
rm(dummyEnv)
gc()
checkMem_gc("After clean up")
final.mem.size <- memory.size()
end_ObjSizes <- sapply(ls(), function(x) {object.size(get(x))})
print("")
print("")
print("--- The sizes of all objects (under .GlobalEnv) BEFORE the graph plotting process ---")
print("--- (Before the process starts, all existing objects are stored under .GlobalEnv) ---")
print(start_ObjSizes)
print("")
print("--- The sizes of all objects (under .GlobalEnv) AFTER the graph plotting process ---")
print(end_ObjSizes)
print("--- I have not altered any existing objects under .GlobalEnv during the process, I only passed them to functions. And yet their sizes increase! ---")
print("--- Let's look at the object tab_ind, which shows the largest inflation in object size ---")
print("--- This is the size of tab_ind BEFORE the process: ---")
print(start_tab_ind_size)
print("--- This is the size of tab_ind AFTER the process: ---")
print(object.size(tab_ind))
print("--- But they are identical (checked using the function identical())! ---")
print(identical(start_tab_ind, tab_ind))
print("")

UPDATED REPRODUCIBLE EXAMPLE

This is an updated, simpler reproducible example. The latest finding is that to make a copy of data.table object, <- data.table::copy() should be used instead of <-. The latter only creates a pointer to the same value (i.e. by reference). Altering the value of the new pointer would changes the object size of the original pointer, that was why object size inflated when I made change to the new pointer. Although I am not sure if it is the only source of memory usage inflation.

library(data.table)
library(magrittr)
library(ggplot2)

N <- 6000

set.seed(runif(1, 0, .Machine$integer.max) %>% ceiling)

logit <- function(x) {return(log(x/(1-x)))}
invLogit <- function(x) {return(exp(x)/(1+exp(x)))}

tab_dat <- data.table(datasetID = seq(N), MIX_MIN_SUCCESS = sample(c(0, 1), N, replace = T), MIX_ALL = sample(c(0, 1), N, replace = T))
tab_dat[MIX_MIN_SUCCESS == 0, MIX_ALL := 0]
n <- sample(20:300, N, replace = T)
tab_ind <- data.table(
  datasetID = rep(seq(N), times = n),
  SIM_ADJ_PP1 = runif(sum(n), 0.00001, 0.99999),
  MIX_ADJ_PP1 = runif(sum(n), 0.00001, 0.99999)
)

## Individual bins for x and y
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv, by = "quantile") {
  #Binning
  if (by == "even") {
    minN = dt[, min(get(x), na.rm = T)]
    maxN = dt[, max(get(x), na.rm = T)]
    xBreaks = seq(minN, maxN, length.out = xNItv + 1)
    yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
  } else if (by == "quantile") {
    xBreaks = dt[, quantile(get(x), seq(0, 1, length.out = xNItv + 1), names = F)]
    yBreaks = dt[, quantile(get(y), seq(0, 1, length.out = yNItv + 1), names = F)]
  }
  xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
  xbinMid = sapply(seq(xNItv), function(i) {return(mean(xBreaks[c(i, i+1)]))})[xbinCode]
  ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
  ybinMid = sapply(seq(yNItv), function(i) {return(mean(yBreaks[c(i, i+1)]))})[ybinCode]
  #Creating table
  tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
  tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
    tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
    ]
  colnames(tab_plot)[colnames(tab_plot) == "xbinCode"] = paste0(x, "_binCode")
  colnames(tab_plot)[colnames(tab_plot) == "xbinMid"] = paste0(x, "_binMid")
  colnames(tab_plot)[colnames(tab_plot) == "ybinCode"] = paste0(y, "_binCode")
  colnames(tab_plot)[colnames(tab_plot) == "ybinMid"] = paste0(y, "_binMid")
  rm(list = c("xBreaks", "yBreaks", "xbinCode", "ybinCode", "xbinMid", "ybinMid", "tab_match"))
  #Returning table
  return(tab_plot)
}

plotEnv <- new.env()
backupEnv <- new.env()

gc()
gc(verbose = T)
start.mem.size <- memory.size()
start_ObjSizes <- sapply(ls(), function(x) {object.size(get(x))})
start_tab_ind <- copy(tab_ind)
start_tab_ind_size <- object.size(tab_ind)
dummyEnv <- new.env()
with(dummyEnv, {
  ## Set function for analyses against SIM_PP1
  fcn_SIM_PP1 <- function(dt, newTab = T) {
    dat_prob = tab_by_bin_idxy(dt, x = "SIM_ADJ_PP1", y = "MIX_ADJ_PP1", xNItv = 50, yNItv = 50, by = "even")

    plot_prob <- ggplot(dat_prob, aes(x = SIM_ADJ_PP1_binMid)) +
      geom_vline(xintercept = 1, linetype = "dotted") +
      geom_hline(yintercept = 1, linetype = "dotted") +
      geom_abline(slope = 1, intercept = 0, size = 1.5, linetype = "dashed", alpha = 0.5) +
      geom_point(aes(y = MIX_ADJ_PP1_binMid, size = N), alpha = 0.5, na.rm = T) +
      scale_size_continuous(range = c(0.5, 5)) +
      scale_x_continuous(name = "Simulated PP", breaks = seq(0, 1, 0.25),
                         labels = c("0%", "25%", "50%", "75%", "100%")) +
      scale_y_continuous(name = "Estimated PP", limits = c(0, 1), breaks = seq(0, 1, 0.25),
                         labels = c("0%", "25%", "50%", "75%", "100%")) +
      theme_classic() +
      theme(axis.title = element_text(size = 18),
            axis.text = element_text(size = 16))

    return(plot_prob)
  }

  ## Tabling
  tab_stat <- copy(tab_ind)
  tab_stat <- tab_stat[, c("MIX_MIN_SUCCESS", "MIX_ALL") := list(
    tab_dat[tab_stat[, datasetID], MIX_MIN_SUCCESS],
    tab_dat[tab_stat[, datasetID], MIX_ALL]
  )]

  tab_stat_MIN_SUCCESS <- tab_stat[MIX_MIN_SUCCESS == 1]

  tab_stat_MIX_ALL <- tab_stat[MIX_ALL == 1]

  # Generating ggplot objects
  lst_full <- fcn_SIM_PP1(tab_stat, newTab = F)
  lst_MIN_SUCCESS <- fcn_SIM_PP1(tab_stat_MIN_SUCCESS, newTab = F)
  lst_MIX_ALL <- fcn_SIM_PP1(tab_stat_MIX_ALL, newTab = F)

  ## Start plotting
  assign("full_sp_MIX_ADJ_PP1_vs_SIM_ADJ_PP1", lst_full, envir = plotEnv)
  assign("MIN_SUCCESS_sp_MIX_ADJ_PP1_vs_SIM_ADJ_PP1", lst_MIN_SUCCESS, envir = plotEnv)
  assign("MIX_ALL_sp_MIX_ADJ_PP1_vs_SIM_ADJ_PP1", lst_MIX_ALL, envir = plotEnv)
})

rm(dummyEnv)
rm(start_tab_ind)
gc(verbose = T)
final.mem.size <- memory.size()
end_ObjSizes <- sapply(ls(), function(x) {object.size(get(x))})

My sessionInfo() when running the above example:

R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
  [1] LC_COLLATE=English_Hong Kong SAR.1252  LC_CTYPE=English_Hong Kong SAR.1252    LC_MONETARY=English_Hong Kong SAR.1252
[4] LC_NUMERIC=C                           LC_TIME=English_Hong Kong SAR.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] ggplot2_2.2.1     magrittr_1.5      data.table_1.11.4

loaded via a namespace (and not attached):
  [1] colorspace_1.3-2 scales_0.5.0     compiler_3.5.0   lazyeval_0.2.1   plyr_1.8.4       tools_3.5.0      pillar_1.2.3     gtable_0.2.0    
[9] tibble_1.4.2     yaml_2.1.19      Rcpp_0.12.18     grid_3.5.0       rlang_0.2.1      munsell_0.4.3   

Upvotes: 3

Views: 481

Answers (1)

Technophobe01
Technophobe01

Reputation: 8676

My sense is you need to increase the --min-vsize=. Why? The error cannot allocate vector of size ... implies you need to increase --min-vsize=.

R Command Line Invocation:

R --min-vsize=400M

RStudio Invocation

Create or add an entry to your .Renviron file.

R_VSIZE=400M

Ref: Friendly R Startup Configuration

Key Questions:

  • Are you running a 64bit OS? [Yes/No]
  • Are you running a 64bit version of R? [Yes/No]

if you answer "No" to either of these questions I'd recommend you upgrade.

Background

The reality here is that if you need to increase the minimum vsize, you likely want to look at your code for assignment gotchas. In most cases, you'll find that you are duplicating data via copy assignment.

For more information on R Gotcha's I highly recommend you read:

The detail behind it all.

R maintains separate areas for fixed and variable sized objects. The first of these is allocated as an array of cons cells (Lisp programmers will know what they are, others may think of them as the building blocks of the language itself, parse trees, etc.), and the second are thrown on a heap of ‘Vcells’ of 8 bytes each. Each cons cell occupies 28 bytes on a 32-bit build of R, (usually) 56 bytes on a 64-bit build.

The default values are (currently) an initial setting of 350k cons cells and 6Mb of vector heap. Note that the areas are not actually allocated initially: rather these values are the sizes for triggering garbage collection. These values can be set by the command line options --min-nsize and --min-vsize (or if they are not used, the environment variables R_NSIZE and R_VSIZE) when R is started. Thereafter R will grow or shrink the areas depending on usage, never decreasing below the initial values. The maximal vector heap size can be set with the environment variable R_MAX_VSIZE.

How much time R spends in the garbage collector will depend on these initial settings and on the trade-off the memory manager makes, when memory fills up, between collecting garbage to free up unused memory and growing these areas. The strategy used for growth can be specified by setting the environment variable R_GC_MEM_GROW to an integer value between 0 and 3. This variable is read at start-up. Higher values grow the heap more aggressively, thus reducing garbage collection time but using more memory.

Ref: https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/Memory

Windows

The address-space limit is 2Gb under 32-bit Windows unless the OS's default has been changed to allow more (up to 3Gb). See https://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx and https://msdn.microsoft.com/en-us/library/bb613473(VS.85).aspx. Under most 64-bit versions of Windows the limit for a 32-bit build of R is 4Gb: for the oldest ones it is 2Gb. The limit for a 64-bit build of R (imposed by the OS) is 8Tb.

It is not normally possible to allocate as much as 2Gb to a single vector in a 32-bit build of R even on 64-bit Windows because of preallocations by Windows in the middle of the address space.

Under Windows, R imposes limits on the total memory allocation available to a single session as the OS provides no way to do so: see memory.size and memory.limit.

Upvotes: 5

Related Questions