Reputation: 695
I am using the fread package in R to read a very large data set with 600 million rows and 2 columns. The data is read in fine, but when I perform a transpose operation on the table, I get a memory error. The transposed dataset would have around 4 million rows and 20,000 columns.
Code:
library(data.table)
#read in data
data = fread("C:/Users/User12/Desktop/train.tsv")
t1 = table(data$Y,data$X)
When I run the above code, I get a memory error. Is there an intelligent way to create a frequency table on this large dataset?
Original Table Example:
data <-
setDT(read.table(text="X Y
4 5
4 6
4 7
5 5
5 6
5 9", header = T))
Output I want:
> data[,table(Y,X)]
X
Y 4 5
5 1 1
6 1 1
7 1 0
9 0 1
Upvotes: 2
Views: 308
Reputation: 34733
The alternative I can suggest is:
dcast(data[ , .N, by = .(X, Y)], X ~ Y, value.var = "N")
But this also fails on a comparable sample set (warning: computationally intensive):
set.seed(1203)
data <- data.table(X = sample(4e6, 6e8, T), Y = sample(2e4, 6e8, T))
tt <- dcast(data[ , .N, by = .(X, Y)], X ~ Y, value.var = "N")
Error: cannot allocate vector of size 298.0 Gb
So you may be out of luck doing this within R (perhaps seek an SQL solution?)
I think the following could be a viable alternative:
setkeyv(data, c("Y", "X"))
lapply(data[ , unique(Y)], function(y) data[.(y), table(X)])
Basically, we change the problem into creating a bunch of one-way tables. As long as that can allocate properly, we should be able to rbind
it into the two-way table you're after.
Barring that, you'll need to consider non-RAM-based alternatives (Spark is a personal favorite, see e.g. SparkR
or sparklyr
, or perhaps SQL-based methods through sqldf
, RPresto
, etc).
Upvotes: 3