Daniele
Daniele

Reputation: 67

R tabulate large table

I have an huge table (millions of rows and 2 columns) similar to the one below.

  1. FIELD1 FIELD2
  2. M01 ABC
  3. M02 ABC
  4. MO3 ABC
  5. M03 CDE
  6. M04 FGH
  7. M05 LMN
  8. M06 FGH
  9. ... ...

FIELD1 may have millions of unique values and FIELD2 may have up to 10,000 unique value. I am using the following statement in R to read the txt file and to tabulate it:

dat<-read.table(file.choose(new = FALSE), sep = "\t")
m=table(dat)

However it returns the following error

Error: cannot allocate vector of size 1.5 Gb R(390,0xac0442c0) malloc: mmap(size=1599119360) failed (error code=12) error: can't allocate region set a breakpoint in malloc_error_break to debug R(390,0xac0442c0) malloc: mmap(size=1599119360) failed (error code=12) error: can't allocate region * set a breakpoint in malloc_error_break to debug*

Any idea how to overcome this limit? Many thanks.

Upvotes: 2

Views: 957

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269694

This solution uses a single read.csv.sql statement to read the data into an SQLite database (which it automatically creates), perform the calculation in the database (not in R) and then only read it into R after the calculation has ben performed. Thus although it does not do a table it still shows which FIELD1 values are associated with each FIELD2 value and does so via a more compact representation.

First create some test data:

# create test file
test <- data.frame(FIELD1 = c("M01", "M02", "MO3", "M03", "M04", "M05", "M06"), 
    FIELD2 = c("ABC", "ABC", "ABC", "CDE", "FGH", "LMN", "FGH"))
write.csv(test, file = "test.csv", row.names = FALSE, quote = FALSE)

Then try this:

library(sqldf)
DF <- read.csv.sql("test.csv", 
    sql = "select FIELD2, group_concat(FIELD1) FIELD1 from file group by FIELD2")

With the test data the result is:

>  DF
  FIELD2      FIELD1
1    ABC M01,M02,MO3
2    CDE         M03
3    FGH     M04,M06
4    LMN         M05

Upvotes: 3

Martin Morgan
Martin Morgan

Reputation: 46866

table will try to create a matrix of length(unique(FIELD1)) by length(unique(FIELD2)), presumably many times the size of your original data; a more efficient representation is as a data frame of counts, along the lines of

lst <- with(test, lapply(split(as.character(FIELD1), FIELD2), table))
df <- data.frame(FIELD1 = unlist(lapply(lst, names), use.names=FALSE), 
                FIELD2 = rep(names(lst), sapply(lst, length)),
                Count = unlist(lst, use.names=FALSE))

This could be represented as a sparse matrix

library(Matrix)
m <- with(df, {
    sparseMatrix(as.integer(FIELD1), as.integer(FIELD2), x=Count, 
                 dimnames=list(levels(FIELD1), levels(FIELD2)))
})

leading to

> m
7 x 4 sparse Matrix of class "dgCMatrix"
    ABC CDE FGH LMN
M01   1   .   .   .
M02   1   .   .   .
M03   .   1   .   .
M04   .   .   1   .
M05   .   .   .   1
M06   .   .   1   .
MO3   1   .   .   .

> colSums(m)
[1] 3 1 2 1

Upvotes: 4

Related Questions