Reputation: 2652
I was working on creating some adjacency matrices and stumbled on a weird issue.
I have one matrix full of 1s and 0s. I want to multiply the transpose of it by it (t(X) %*% X
) and then run some other stuff. Since the routine started to get real slow I converted it to a sparse matrix, which obviously went faster.
However, the sparse matrix gets double the size depending on when I convert the matrix to a sparse format.
Here is some generic example that runs into the same issue
set.seed(666)
nr = 10000
nc = 1000
bb = matrix(rnorm(nc *nr), ncol = nc, nrow = nr)
bb = apply(bb, 2, function(x) x = as.numeric(x > 0))
# Slow and unintelligent method
op1 = t(bb) %*% bb
op1 = Matrix(op1, sparse = TRUE)
# Fast method
B = Matrix(bb, sparse = TRUE)
op2 = t(B) %*% B
# weird
identical(op1, op2) # returns FALSE
object.size(op2)
#12005424 bytes
object.size(op1) # almost half the size
#6011632 bytes
# now it works...
ott1 = as.matrix(op1)
ott2 = as.matrix(op2)
identical(ott1, ott2) # returns TRUE
Then I got curious. Anybody knows why this happens?
Upvotes: 3
Views: 190
Reputation: 31452
The class of op1
is dsCMatrix
, whereas op2
is a dgCMatrix
. dsCMatrix
is a class for symmetric matrices, which therefore only needs to store the upper half plus the diagonal (roughly half as much data as the full matrix).
The Matrix
statement that converts a dense to a sparse matrix is smart enough to choose a symmetric class for symmetric matrices, hence the saving. You can see this in the code for the function Matrix
, which explicitly performs the test isSym <- isSymmetric(data)
.
%*%
on the other hand is optimised for speed and does not perform this check.
Upvotes: 4