Removing lines in data.table and spiking memory usage

Question

I have a data.table of a decent size 89M rows, 3.7Gb. Keys are in place so everything is set-up properly. However I am experiencing a problem when I remove rows based on a column's value. The memory usage just goes through the roof!

Just for the record I have read the other posts here about this, but they don't really help much. Also, I am using RStudio which I am pretty sure is not ideal but it helps while experimenting, however I notice the same behaviour in the R console. I am using Windows.

Let me post an example (taken from a similar question regarding removal of rows) of creating a very big data.table approx 1e6x100

rm(list=ls(all=TRUE))               #Clean stuff
gc(reset=TRUE)                      #Call gc (not really helping but whatever..)
dimension=1e6                       #let's say a million
DT = data.table(col1 = 1:dimension)
cols = paste0('col', 2:100)         #let these be conditions as columns
for (col in cols){ DT[, col := 1:dimension, with = F] }
DT.m<-melt(DT,id=c('col1','col2','col3'))

Ok so now we have a data.table with 97M rows, approx 1.8Gb. This is our starting point.

Let's remove all rows where the value column (after the melt) is e.g. 4

DT.m<-DT.m[value!=4]

The last line takes a huge amount of memory! Prior to executing this line, in my PC the memory usage is approx 4.3Gb, and just after the line is executed, it goes to 6.9Gb!

This is the correct way to remove the lines, right? (just checking). Has anyone come across this behaviour before?

I thought of looping for all parameters and keeping the rows I am interested in, in another data.table but somehow I doubt that this is a proper way of working.

I am looking forward to your help.

Thanks Nikos

Arun · Accepted Answer

Update: With this commit, the logical vector is replaced by row indices to save memory (Read the post below for more info). Fixed in 1.9.5.

Doing sum(DT.m$value == 4L) gives me 97. That is, you're removing a total of 97 rows from 97 million. This in turn implies that the subset operation would return ~1.8GB data set as well.

Your memory usage was 4.3GB to begin with
The condition you provide value == 4 takes the space of a logical vector of size 97 million =~360MB.
data.table computes a which(that_value) to fetch indices = almost all the rows = another 360MB
The data that's being subset has to be allocated elsewhere first, and that's ~1.8GB.

Total comes to 4.3+1.8+0.72 =~ 6.8GB

And garbage collection hasn't happened yet. If you now do gc(), the memory corresponding to old DT.m should be released.

The only place where I can see we can save space is by replacing the logical vector with the integer vector (rather than storing the integer indices in another vector) to save the extra 360MB of space.

Usually which results in a much smaller (negligible) value - and therefore subset is faster - that being the reason for using which(). But in this case, you remove 97 rows.

But good to know that we can save a bit of memory. Could you please file an issue here?

Removing rows by reference, #635, when implemented, should both be fast and memory efficient.

Removing lines in data.table and spiking memory usage

Answers (1)

Related Questions