Reputation: 14902
I have a sparse dgTMatrix from the Matrix package, that has picked up some duplicate colnames
. I want to combine these by summing the columns with the same names, forming a reduced Matrix.
I found this post, which I adapted for sparse matrix operations. But: It's still very slow on large objects. I am wondering if someone has a better solution that operates directly on the indexed elements of the sparse matrix that would be faster. For instance, A@j
indexes (from zero) the labels in A@Dimnames[[2]]
, which could be compacted and used to reindex A@j
. (Note: This is why I used the triplet sparse matrix form rather than the Matrix default of column-sparse matrixes since figuring out that p
value makes my head hurt every time.)
require(Matrix)
# set up a (triplet) sparseMatrix
A <- sparseMatrix(i = c(1, 2, 1, 2, 1, 2), j = 1:6, x = rep(1:3, 2),
giveCsparse = FALSE,
dimnames = list(paste0("r", 1:2), rep(letters[1:3], 2)))
A
## 2 x 6 sparse Matrix of class "dgTMatrix"
## a b c a b c
## r1 1 . 3 . 2 .
## r2 . 2 . 1 . 3
str(A)
## Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:6] 0 1 0 1 0 1
## ..@ j : int [1:6] 0 1 2 3 4 5
## ..@ Dim : int [1:2] 2 6
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:2] "r1" "r2"
## .. ..$ : chr [1:6] "a" "b" "c" "a" ...
## ..@ x : num [1:6] 1 2 3 1 2 3
## ..@ factors : list()
# my matrix-based attempt
OP1 <- function(x) {
nms <- colnames(x)
if (any(duplicated(nms)))
x <- x %*% Matrix(sapply(unique(nms),"==", nms))
x
}
OP1(A)
## 2 x 3 sparse Matrix of class "dgCMatrix"
## a b c
## r1 1 2 3
## r2 1 2 3
It worked fine, but seems quite slow on the huge sparse objects on which I intend to use it. Here's a larger item:
# now something bigger, for testing
set.seed(10)
nr <- 10000 # rows
nc <- 26*100 # columns - 100 repetitions of a-z
nonZeroN <- round(nr * nc / 3) # two-thirds sparse
B <- sparseMatrix(i = sample(1:nr, size = nonZeroN, replace = TRUE),
j = sample(1:nc, size = nonZeroN, replace = TRUE),
x = round(runif(nonZeroN)*5+1),
giveCsparse = FALSE,
dimnames = list(paste0("r", 1:nr), rep(letters, nc/26)))
print(B[1:5, 1:10], col.names = TRUE)
## 5 x 10 sparse Matrix of class "dgTMatrix"
## a b c d e f g h i j
## r1 . . 5 . . 2 . . . .
## r2 . . . . . . . . . 4
## r4 . . . . . . . 3 3 .
## r3 2 2 . 3 . . . 3 . .
## r5 3 . . 1 . . . . . 5
require(microbenchmark)
microbenchmark(OPmatrixCombine1 = OP1(B), times = 30)
## Unit: milliseconds
## expr min lq mean median uq max neval
## OPmatrixCombine1 578.9222 619.3912 665.6301 631.4219 646.2716 1013.777 30
Is there a better way, where better means faster and, if possible, not requiring the construction of additional large objects?
Upvotes: 2
Views: 773
Reputation: 14902
Here's an attempt using the index reindexing I had in mind, which I figured out with a friend's help (Patrick is that you?). It reindexes the j
values, and uses the very handy feature of sparseMatrix()
that adds the x
values together for elements whose index positions are the same.
OP2 <- function(x) {
nms <- colnames(x)
uniquenms <- unique(nms)
# build the sparseMatrix again: x's with same index values are automatically
# added together, keeping in mind that indexes stored from 0 but built from 1
sparseMatrix(i = x@i + 1,
j = match(nms, uniquenms)[x@j + 1],
x = x@x,
dimnames = list(rownames(x), uniquenms),
giveCsparse = FALSE)
}
Results are the same:
OP2(A)
## 2 x 3 sparse Matrix of class "dgCMatrix"
## a b c
## r1 1 2 3
## r2 1 2 3
all.equal(as(OP1(B), "dgTMatrix"), OP2(B))
## [1] TRUE
But faster:
require(microbenchmark)
microbenchmark(OPmatrixCombine1 = OP1(B),
OPreindexSparse = OP2(B),
times = 30)
## Unit: relative
## expr min lq mean median uq max neval
## OPmatrixCombine1 1.756769 1.307651 1.360487 1.341814 1.346864 1.460626 30
## OPreindexSparse 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 30
Upvotes: 2