Reputation: 181
So the idea is to loop through a dataset with 50 million rows and read a million observations each iteration, then take a random sample of 1% from the 1 million rows, and put the sample in a list. In the end this should give around 500k rows, or 1% of 50 million rows. Unfortunately, R is consuming way too much memory as the iterations grow. Am I using the rm() and gc() functions incorrectly to remove the objects? I can't tell where the memory is being used.
library(data.table)
set.seed(123)
iter <- seq(1000000, 47000000, 1000000)
j <- 1
for(i in iter)
{
train <- read.csv(file='train.csv', nrows = 1000000, skip=i,
header=FALSE)
smp_size <- floor(.01 * nrow(train))
train_ind <- sample(seq_len(nrow(train)), size = smp_size)
train <- train[train_ind, ]
datalist[[j]] <- train
j <- j+1
rm(train)
rm(train_ind)
rm(smp_size)
gc()
}
newtrain <- rbindlist(datalist)
Upvotes: 2
Views: 1200
Reputation: 181
I have solved the problem. It seems to be that the code is perfectly well and good but the OS was not releasing the memory R clears up, whether through gc() or once the objects go out of scope. This is an issue that has been discussed on this website but very infrequently and vaguely. Indeed, it is hard to find much info about this issue the internet.
Switching from Windows XP to Windows 7 and then finally to Linux solved the problem. Mallinfo also can be used with Linux to request the OS to relse the memory it has allocated to R.
Upvotes: 0
Reputation: 522762
The performance lag you are seeing as your script and for loop progress is most likely due to the overhead of your list datalist
as it grows to roughly 50 million rows.
Here is your call to read.csv
:
train <- read.csv(file='train.csv', nrows = 1000000, skip=i, header=FALSE)
It appears that you are reading one million rows at a time, up to about 50 million rows. In your comment you mentioned:
"1 million rows would take maybe 100 mbs or so"
Then 50 million rows should take about 5GB, which is exactly in line with what you are observing.
As an aside, running gc()
should remove the train
, train_ind
, and smp_size
variables from your workspace. However, this is a moot point because they are clobbered (overwritten) in each subsequent iteration of the for loop. In other words, garbage collecting these variables won't really improve your performance because they are scoped to the for loop, which will be garbage collected anyway after each iteration.
Upvotes: 1