zorny
zorny

Reputation: 695

Effiicient transformation of data.table in R

I am using the fread package in R to read a very large data set with 600 million rows and 2 columns. The data is read in fine, but when I perform a transpose operation on the table, I get a memory error. The transposed dataset would have around 4 million rows and 20,000 columns.

Code:

library(data.table)
#read in data
data = fread("C:/Users/User12/Desktop/train.tsv")
t1 = table(data$Y,data$X)

When I run the above code, I get a memory error. Is there an intelligent way to create a frequency table on this large dataset?

Original Table Example:

data <- 
  setDT(read.table(text="X    Y
4    5
4    6
4    7
5    5
5    6
5    9", header = T))

Output I want:

> data[,table(Y,X)]
   X
Y   4 5
  5 1 1
  6 1 1
  7 1 0
  9 0 1

Upvotes: 2

Views: 308

Answers (1)

MichaelChirico
MichaelChirico

Reputation: 34733

The alternative I can suggest is:

dcast(data[ , .N, by = .(X, Y)], X ~ Y, value.var = "N")

But this also fails on a comparable sample set (warning: computationally intensive):

set.seed(1203)
data <- data.table(X = sample(4e6, 6e8, T), Y = sample(2e4, 6e8, T))
tt <- dcast(data[ , .N, by = .(X, Y)], X ~ Y, value.var = "N")

Error: cannot allocate vector of size 298.0 Gb

So you may be out of luck doing this within R (perhaps seek an SQL solution?)

Update: keyed approach

I think the following could be a viable alternative:

setkeyv(data, c("Y", "X"))

lapply(data[ , unique(Y)], function(y) data[.(y), table(X)])

Basically, we change the problem into creating a bunch of one-way tables. As long as that can allocate properly, we should be able to rbind it into the two-way table you're after.

Barring that, you'll need to consider non-RAM-based alternatives (Spark is a personal favorite, see e.g. SparkR or sparklyr, or perhaps SQL-based methods through sqldf, RPresto, etc).

Upvotes: 3

Related Questions