Reputation: 45
I am attempting to use the Matrix package to bind two sparse matrices of different size together. The binding is on rows, using the column names for matching.
Table A:
ID | AAAA | BBBB |
------ | ------ | ------ |
XXXX | 1 | 2 |
Table B:
ID | BBBB | CCCC |
------ | ------ | ------ |
YYYY | 3 | 4 |
Binding table A and B:
ID | AAAA | BBBB | CCCC |
------ | ------ | ------ | ------ |
XXXX | 1 | 2 | |
YYYY | | 3 | 4 |
The intention is to insert a large number of small matrices into a single large matrix, to enable continuous querying and update/inserts.
I find that neither the Matrix or slam packages have functionality to handle this.
Similar questions have been asked in the past, but it seems no solution has been found:
Post 1: in-r-when-using-named-rows-can-a-sparse-matrix-column-be-added-concatenated
Post 2: bind-together-sparse-model-matrices-by-row-names
Ideas on how to solve it will be highly appreciated.
Best regards,
Frederik
Upvotes: 4
Views: 3925
Reputation: 647
Starting from Valentin's answer above, I made my own merge.sparse function, to achieve the following:
The code below seems to do that:
if (length(find.package(package="Matrix",quiet=TRUE))==0) install.packages("Matrix")
require(Matrix)
merge.sparse <- function(...) {
cnnew <- character()
rnnew <- character()
x <- vector()
i <- numeric()
j <- numeric()
for (M in list(...)) {
cnold <- colnames(M)
rnold <- rownames(M)
cnnew <- union(cnnew,cnold)
rnnew <- union(rnnew,rnold)
cindnew <- match(cnold,cnnew)
rindnew <- match(rnold,rnnew)
ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)
}
sparseMatrix(i=i,j=j,x=x,dims=c(length(rnnew),length(cnnew)),dimnames=list(rnnew,cnnew))
}
I tested it with the following data:
df1 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("N","N","M","X","X","Z","Z"))
M1 <- xtabs(~y+x,df1,sparse=T)
df2 <- data.frame(x=c("S","S","T","T","U","V","V","W","W","X"),y=c("N","M","M","K","Z","M","N","N","K","Z"))
M2 <- xtabs(~y+x,df2,sparse=T)
df3 <- data.frame(x=c("A","C","C","B"),y=c("N","M","Z","K"))
M3 <- xtabs(~y+x,df3,sparse=T)
df4 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("F","F","G","G","H","I","L"))
M4 <- xtabs(~y+x,df4,sparse=T)
df5 <- data.frame(x=c("K1","K2","K3","K4"),y=c("J1","J2","J3","J4"))
M5 <- xtabs(~y+x,df5,sparse=T)
Which gave:
Ms <- merge.sparse(M1,M2,M3,M4,M5)
as.matrix(Ms)
# N R S T U V W X A B C K1 K2 K3 K4
#M 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0
#N 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0
#X 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
#Z 0 0 0 1 2 0 0 1 0 0 1 0 0 0 0
#K 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0
#F 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#G 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
#H 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
#I 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
#L 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
#J1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
#J2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
#J3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#J4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Ms
#14 x 15 sparse Matrix of class "dgCMatrix"
# [[ suppressing 15 column names ‘N’, ‘R’, ‘S’ ... ]]
#
#M . 1 1 1 . 1 . . . . 1 . . . .
#N 1 1 1 . . 1 1 . 1 . . . . . .
#X . . 1 1 . . . . . . . . . . .
#Z . . . 1 2 . . 1 . . 1 . . . .
#K . . . 1 . . 1 . . 1 . . . . .
#F 1 1 . . . . . . . . . . . . .
#G . 1 1 . . . . . . . . . . . .
#H . . . 1 . . . . . . . . . . .
#I . . . 1 . . . . . . . . . . .
#L . . . . 1 . . . . . . . . . .
#J1 . . . . . . . . . . . 1 . . .
#J2 . . . . . . . . . . . . 1 . .
#J3 . . . . . . . . . . . . . 1 .
#J4 . . . . . . . . . . . . . . 1
I don't know why column names are 'suppressed' when trying to display the merged sparse matrix Ms
; converting to a non-sparse matrix does bring them back, so...
Also, I noticed that when the same 'coordinates' are included multiple times, the sparse matrix contains the sum of the corresponding values in x
(see row "Z", column "U", which is 1 in both M1
and M2
). Maybe there is a way to change that, but for my applications this is fine.
I though I'd share this code in case anyone else needed to merge sparse matrices this way, and in case someone can test it on large matrices and suggest performance improvements.
After checking this post I found that the extraction of the information about (non-zero) elements of the sparse matrix can be done much more easily by summary
, without using which
.
So this part of my code above:
ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)
can be replaced by:
ind <- summary(M)
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,ind[,3])
Now I don't know which of these is computationally more efficient, or of there is an even easier way to do this by changing the dimensions of matrices and then just summing them, but this seems to work for me, so...
Upvotes: 3
Reputation: 691
For my purposes (very sparse matrix with millions of rows, and tens of thousands of columns, more than 99.9% of the values empty) this was still much too slow. What worked was the code below - might be helpful to others as well:
merge.sparse = function(listMatrixes) {
# takes a list of sparse matrixes with different columns and adds them row wise
allColnames <- sort(unique(unlist(lapply(listMatrixes,colnames))))
for (currentMatrix in listMatrixes) {
newColLocations <- match(colnames(currentMatrix),allColnames)
indexes <- which(currentMatrix>0, arr.ind = T)
newColumns <- newColLocations[indexes[,2]]
rows <- indexes[,1]
newMatrix <- sparseMatrix(i=rows,j=newColumns, x=currentMatrix@x,
dims=c(max(rows),length(allColnames)))
if (!exists("matrixToReturn")) {
matrixToReturn <- newMatrix
}
else {
matrixToReturn <- rbind2(matrixToReturn,newMatrix)
}
}
colnames(matrixToReturn) <- allColnames
matrixToReturn
}
Upvotes: 6
Reputation: 1316
If one needs to combine/concatenate many small sparse matrices into one large sparse matrix, it's much better and more efficient to use a mapping of global and local row and column indices to construct a large sparse matrix. E.g.,
globalInds <- matrix(NA, nrow=dim(localPairRowColInds)[1], 2)
# extract the corresponding global row indices for the local row indices
globalInds[ , 1] <- globalRowInds[ localPairRowColInds[,1] ]
globalInds[ , 2] <- globalColInds[ localPairRowColInds[,2] ]
write.table(cbind(globalInds, localPairVals), file=dataFname, append = T, sep = " ", row.names = F, col.names = F)
Upvotes: 0
Reputation: 31452
We can create an empty sparse Matrix that has all the rows and columns, then insert the values into it using subset assignment:
my.bind = function(A, B){
C = Matrix(0, nrow = NROW(A) + NROW(B), ncol = length(union(colnames(A), colnames(B))),
dimnames = list(c(rownames(A), rownames(B)), union(colnames(A), colnames(B))))
C[rownames(A), colnames(A)] = A
C[rownames(B), colnames(B)] = B
return(C)
}
my.bind(A,B)
# 2 x 3 sparse Matrix of class "dgCMatrix"
# AAAA BBBB CCCC
# XXXX 1 2 .
# YYYY . 3 4
Note that the above assumes that the A and B do not share row names. If there are shared row names, then you should use row numbers instead of names for the assignment.
The data:
library(Matrix)
A = Matrix(c(1,2), 1, dimnames = list('XXXX', c('AAAA','BBBB')))
B = Matrix(c(3,4), 1, dimnames = list('YYYY', c('BBBB','CCCC')))
Upvotes: 0
Reputation: 345
It looks it's necessary to have empty columns (columns with 0s) added to the matrices so to make them compatible for a rbind
(matrices with the same column names, and on the same order). The following code does it:
# dummy data
set.seed(3344)
A = Matrix(matrix(rbinom(16, 2, 0.2), 4))
colnames(A)=letters[1:4]
B = Matrix(matrix(rbinom(9, 2, 0.2), 3))
colnames(B) = letters[3:5]
# finding what's missing
misA = colnames(B)[!colnames(B) %in% colnames(A)]
misB = colnames(A)[!colnames(A) %in% colnames(B)]
misAl = as.vector(numeric(length(misA)), "list")
names(misAl) = misA
misBl = as.vector(numeric(length(misB)), "list")
names(misBl) = misB
## adding missing columns to initial matrices
An = do.call(cbind, c(A, misAl))
Bn = do.call(cbind, c(B, misBl))[,colnames(An)]
# final bind
rbind(An, Bn)
Upvotes: 4