Reputation: 1399
I have a relatively large data.table (around 1 billion rows, and 30 columns), and am trying to subset it to remove some categories I'm not interested in. The category
variable is a factor with around 30 labels. However, when I do this my session is consistently killed. Is there a way to subset a data.table in place?
Given my data.table is dt
, the line in question which causes the crash is:
dt <- dt[!category %in% c('f', 'g')]
Any suggestions for how to avoid this issue? Apologies for the lack of a reproducible example, it's obviously difficult with this scale of data. I'm using R version 3.6.1 and data.table version 1.12.9.
Upvotes: 0
Views: 294
Reputation: 24888
I tried some approaches with 500 million rows and 5 columns.
I made about a 8% improvement in memory allocation with a couple optimizations:
Edit: You can get another 3-4% with @Henrik's suggestion.
library(data.table)
library(bench)
set.seed(3)
#sample.size <- 500000000 #Don't try this on your home laptop folks
sample.size <- 1000000
test.dt <- data.table(category = sample(as.factor(letters),size = sample.size, replace = TRUE),
as.data.table(lapply(1:5,function(x)as.integer(runif(sample.size,1,100)))))
mark(result <- test.dt[!category %in% c('f', 'g')],
result <- test.dt[!(category == 'f' | category == 'g')],
result <- test.dt[!c('f','g'),on = "category"])
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 result <- test.dt[!category %in% c("f", "g")] 18.15s 18.15s 0.0551 23.2GB 0.110 1 2 18.15s
2 result <- test.dt[!(category == "f" | category == "g")] 8.43s 8.43s 0.119 21.3GB 0.119 1 1 8.43s
3 result <- test.dt[!c("f", "g"), on = "category"] 7.83s 7.83s 0.128 20.6GB 0.383 1 3 7.83s
Upvotes: 3