BioChemoinformatics
BioChemoinformatics

Reputation: 715

R: speed up 'table()' operation for a data.table object

I want like convert a data.table object to the following condition by using table function. That means, colb as colnames, and cola as rownames, and then if the individual col include corresponding row, then put the entry as 1, otherwise 0. I do it like this:

dt <- data.table(cola = c(1, 1, 2, 3), colb = c(10, 20, 30, 40))
dt
table(dt)

> dt
   cola colb
1:    1   10
2:    1   20
3:    2   30
4:    3   40
> table(dt)
colb
cola 10 20 30 40
   1  1  1  0  0
   2  0  0  1  0
   3  0  0  0  1

But when the data set is big, for example 39-million rows by 2 columns in my case, the table operation cost ~80 seconds to finish.

I would like to know is there more efficient method to do the same thing as table function did?

In addition, dcast.data.table(dt, cola ~ colb, fill = 0L) do the same thing when I try it, but the resutls have a little different which should be futher deal with to get the same results as table function. The important thing is that dcast.data.table do not improve the speed when I try my data. So, I hope someone can help to work out more efficient method to do the same thing!

Thank you.

Upvotes: 3

Views: 207

Answers (1)

BioChemoinformatics
BioChemoinformatics

Reputation: 715

First, Thanks @Arun and all. Yes, sparseMatrix can solve my original qestions. Here I list the answer (according to Arun's suggestion) . Here is just an demo example I originally hope:

dt <- data.table(sid = c(1, 2, 3, 4, 3, 2, 1, 6, 1, 2), 
                 aid = c(100, 100, 100, 100, 200, 200, 200, 300, 300, 300))
dt
library(Matrix)
sm <- sparseMatrix(dt[, sid], dt[, aid], x = TRUE)   
cp <- t(sm) %*% sm   
cp <- summary(cp)  
cp <- cp[cp$i < cp$j, ]     
as.data.frame(cp)
    i   j x
4 100 200 3
7 100 300 2
8 200 300 2

This method is more efficent than I used before ('Ananda Mahto' method). The time comparison for my data set (39763098 rows and 2 columns): ~141 seconds VS ~40 seconds for my original method and Arun's method respectively. So Thanks, it is perfect. Second, Hope that for my current post's question, data.table can give a improvement too. You will.

Upvotes: 2

Related Questions