R taking up too much memory when looping to create list of data frames

Question

So the idea is to loop through a dataset with 50 million rows and read a million observations each iteration, then take a random sample of 1% from the 1 million rows, and put the sample in a list. In the end this should give around 500k rows, or 1% of 50 million rows. Unfortunately, R is consuming way too much memory as the iterations grow. Am I using the rm() and gc() functions incorrectly to remove the objects? I can't tell where the memory is being used.

library(data.table)

set.seed(123)
iter <- seq(1000000, 47000000, 1000000)
j <- 1
for(i in iter)
{ 
train <- read.csv(file='train.csv', nrows = 1000000, skip=i, 
header=FALSE) 

smp_size <- floor(.01 * nrow(train)) 
train_ind <- sample(seq_len(nrow(train)), size = smp_size)
train <- train[train_ind, ]

datalist[[j]] <- train
j <- j+1

rm(train)
rm(train_ind)
rm(smp_size)
gc()
}

newtrain <- rbindlist(datalist)

Tim Biegeleisen · Accepted Answer

The performance lag you are seeing as your script and for loop progress is most likely due to the overhead of your list datalist as it grows to roughly 50 million rows.

Here is your call to read.csv:

train <- read.csv(file='train.csv', nrows = 1000000, skip=i, header=FALSE)

It appears that you are reading one million rows at a time, up to about 50 million rows. In your comment you mentioned:

"1 million rows would take maybe 100 mbs or so"

Then 50 million rows should take about 5GB, which is exactly in line with what you are observing.

As an aside, running gc() should remove the train, train_ind, and smp_size variables from your workspace. However, this is a moot point because they are clobbered (overwritten) in each subsequent iteration of the for loop. In other words, garbage collecting these variables won't really improve your performance because they are scoped to the for loop, which will be garbage collected anyway after each iteration.

R taking up too much memory when looping to create list of data frames

Answers (2)

Related Questions