Reputation: 715
I want like convert a data.table object to the following condition by using table
function. That means, colb
as colnames, and cola
as rownames, and then if the individual col
include corresponding row
, then put the entry as 1
, otherwise 0
. I do it like this:
dt <- data.table(cola = c(1, 1, 2, 3), colb = c(10, 20, 30, 40))
dt
table(dt)
> dt
cola colb
1: 1 10
2: 1 20
3: 2 30
4: 3 40
> table(dt)
colb
cola 10 20 30 40
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
But when the data set is big, for example 39-million rows by 2 columns in my case, the table operation cost ~80 seconds to finish.
I would like to know is there more efficient method to do the same thing as table function did?
In addition, dcast.data.table(dt, cola ~ colb, fill = 0L)
do the same thing when I try it, but the resutls have a little different which should be futher deal with to get the same results as table function. The important thing is that dcast.data.table
do not improve the speed when I try my data. So, I hope someone can help to work out more efficient method to do the same thing!
Thank you.
Upvotes: 3
Views: 207
Reputation: 715
First, Thanks @Arun and all. Yes, sparseMatrix
can solve my original qestions. Here I list the answer (according to Arun's suggestion) . Here is just an demo example I originally hope:
dt <- data.table(sid = c(1, 2, 3, 4, 3, 2, 1, 6, 1, 2),
aid = c(100, 100, 100, 100, 200, 200, 200, 300, 300, 300))
dt
library(Matrix)
sm <- sparseMatrix(dt[, sid], dt[, aid], x = TRUE)
cp <- t(sm) %*% sm
cp <- summary(cp)
cp <- cp[cp$i < cp$j, ]
as.data.frame(cp)
i j x
4 100 200 3
7 100 300 2
8 200 300 2
This method is more efficent than I used before ('Ananda Mahto' method). The time comparison for my data set (39763098 rows and 2 columns): ~141 seconds VS ~40 seconds for my original method and Arun's method respectively. So Thanks, it is perfect. Second, Hope that for my current post's question, data.table can give a improvement too. You will.
Upvotes: 2