mlinegar
mlinegar

Reputation: 1399

R's data.table crashing upon subset of factors for large data

I have a relatively large data.table (around 1 billion rows, and 30 columns), and am trying to subset it to remove some categories I'm not interested in. The category variable is a factor with around 30 labels. However, when I do this my session is consistently killed. Is there a way to subset a data.table in place?

Given my data.table is dt, the line in question which causes the crash is:

dt <- dt[!category %in% c('f', 'g')]

Any suggestions for how to avoid this issue? Apologies for the lack of a reproducible example, it's obviously difficult with this scale of data. I'm using R version 3.6.1 and data.table version 1.12.9.

Upvotes: 0

Views: 294

Answers (1)

Ian Campbell
Ian Campbell

Reputation: 24888

I tried some approaches with 500 million rows and 5 columns.

I made about a 8% improvement in memory allocation with a couple optimizations:

Edit: You can get another 3-4% with @Henrik's suggestion.

library(data.table)
library(bench)
set.seed(3)
#sample.size <- 500000000 #Don't try this on your home laptop folks
sample.size <- 1000000
test.dt <- data.table(category = sample(as.factor(letters),size = sample.size, replace = TRUE),
                      as.data.table(lapply(1:5,function(x)as.integer(runif(sample.size,1,100)))))

mark(result <- test.dt[!category %in% c('f', 'g')],
     result <- test.dt[!(category == 'f' | category == 'g')],
     result <- test.dt[!c('f','g'),on = "category"])


 expression                                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time 
  <bch:expr>                                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 result <- test.dt[!category %in% c("f", "g")]             18.15s   18.15s    0.0551    23.2GB    0.110     1     2     18.15s
2 result <- test.dt[!(category == "f" | category == "g")]    8.43s    8.43s    0.119     21.3GB    0.119     1     1      8.43s
3 result <- test.dt[!c("f", "g"), on = "category"]           7.83s    7.83s    0.128     20.6GB    0.383     1     3      7.83s

Upvotes: 3

Related Questions