Reputation: 67
I have an huge table (millions of rows and 2 columns) similar to the one below.
FIELD1 may have millions of unique values and FIELD2 may have up to 10,000 unique value. I am using the following statement in R to read the txt file and to tabulate it:
dat<-read.table(file.choose(new = FALSE), sep = "\t")
m=table(dat)
However it returns the following error
Error: cannot allocate vector of size 1.5 Gb R(390,0xac0442c0) malloc: mmap(size=1599119360) failed (error code=12) error: can't allocate region set a breakpoint in malloc_error_break to debug R(390,0xac0442c0) malloc: mmap(size=1599119360) failed (error code=12) error: can't allocate region * set a breakpoint in malloc_error_break to debug*
Any idea how to overcome this limit? Many thanks.
Upvotes: 2
Views: 957
Reputation: 269694
This solution uses a single read.csv.sql
statement to read the data into an SQLite database (which it automatically creates), perform the calculation in the database (not in R) and then only read it into R after the calculation has ben performed. Thus although it does not do a table
it still shows which FIELD1 values are associated with each FIELD2 value and does so via a more compact representation.
First create some test data:
# create test file
test <- data.frame(FIELD1 = c("M01", "M02", "MO3", "M03", "M04", "M05", "M06"),
FIELD2 = c("ABC", "ABC", "ABC", "CDE", "FGH", "LMN", "FGH"))
write.csv(test, file = "test.csv", row.names = FALSE, quote = FALSE)
Then try this:
library(sqldf)
DF <- read.csv.sql("test.csv",
sql = "select FIELD2, group_concat(FIELD1) FIELD1 from file group by FIELD2")
With the test data the result is:
> DF
FIELD2 FIELD1
1 ABC M01,M02,MO3
2 CDE M03
3 FGH M04,M06
4 LMN M05
Upvotes: 3
Reputation: 46866
table
will try to create a matrix of length(unique(FIELD1))
by length(unique(FIELD2))
, presumably many times the size of your original data; a more efficient representation is as a data frame of counts, along the lines of
lst <- with(test, lapply(split(as.character(FIELD1), FIELD2), table))
df <- data.frame(FIELD1 = unlist(lapply(lst, names), use.names=FALSE),
FIELD2 = rep(names(lst), sapply(lst, length)),
Count = unlist(lst, use.names=FALSE))
This could be represented as a sparse matrix
library(Matrix)
m <- with(df, {
sparseMatrix(as.integer(FIELD1), as.integer(FIELD2), x=Count,
dimnames=list(levels(FIELD1), levels(FIELD2)))
})
leading to
> m
7 x 4 sparse Matrix of class "dgCMatrix"
ABC CDE FGH LMN
M01 1 . . .
M02 1 . . .
M03 . 1 . .
M04 . . 1 .
M05 . . . 1
M06 . . 1 .
MO3 1 . . .
> colSums(m)
[1] 3 1 2 1
Upvotes: 4