Reputation: 47
I have a large sparse matrix which I need to correlate which haven't been possible for me because:
bigstats
and bigmemory
, my R froze over (using a windows 10, 8GB laptop)Matrix
packageN.B: I am trying to correlate a sparse matrix in the format 'M'X + X'M - M'M'
which isn't possible which is why I am trying to split the sparse matrix into two or three, turn to dense matrix using as.matrix()
then correlate using the cor()
then combine the correlated results into one using cbind()
Now:
I want to ask if it's possible to split a sparse matrix into two or three parts, convert to dense matrices then correlate each dense matrix then cbind the two or three dense matrices into one then export to a text file.
What function can I use to split a sparse matrix into two or three bearing in mind that both the i
and p
parts of the sparse matrix are equal sizes with the same dim
Formal class 'dgCMatrix' [package "Matrix"] with 7 slots
..@ i : int [1:73075722] ...
..@ p : int [1:73075722] 0 0 1 1 1 1 1 2 2 2 ...
..@ Dim : int [1:2] 500232 500232
..@ Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..@ x : num [1:73075722] ...
..@ uplo : chr "L"
..@ factors : list()
The correlation output will be in this format:
[,1] [,2] [,3] [,4]
[1,] 1.00000000 -0.8343860 0.3612926 0.09678096
[2,] -0.83438600 1.0000000 -0.8154071 0.24611830
[3,] 0.36129256 -0.8154071 1.0000000 -0.51801346
[4,] 0.09678096 0.2461183 -0.5180135 1.00000000
[5,] 0.67411584 -0.3560782 -0.1056124 0.60987601
[6,] 0.23071712 -0.4457467 0.5117711 0.21848068
[7,] 0.49200080 -0.4246502 0.2016633 0.46971736
[,5] [,6] [,7]
[1,] 0.6741158 0.2307171 0.4920008
[2,] -0.3560782 -0.4457467 -0.4246502
[3,] -0.1056124 0.5117711 0.2016633
[4,] 0.6098760 0.2184807 0.4697174
[5,] 1.0000000 0.2007979 0.7198228
[6,] 0.2007979 1.0000000 0.6965899
[7,] 0.7198228 0.6965899 1.0000000
Upvotes: 0
Views: 631
Reputation: 11
I implement dicing of a sparse_matrix by row indexing:
as.DF_spM <- function(data.use,chun_size="20000000",dg_class="dgCMatrix") {
lapply( split(seq(nrow(data.use)), (seq(nrow(data.use))-1) %/% as.numeric(chun_size) ) , function(nx) {
switch(dg_class,
dgTMatrix = {as(as.matrix(data.use[nx,]),"dgTMatrix")},
dgCMatrix = {as(as.matrix(data.use[nx,]),"dgCMatrix")},
dgRMatrix = {as(as.matrix(data.use[nx,]),"dgRMatrix")}
)
}) %>% Matrix.utils::rBind.fill()
}
Upvotes: 0
Reputation: 11255
The cor()
function is a way to efficiently calculate the pearson
correlation between all of the columns. cor
is very efficient but if we do not mind a hit in efficiency, we can calculate the pearson
correlation manually:
n_row <- nrow(res)
cor_mat <- Matrix(0L,n_row, ncol(res))
cMeans <- Matrix::colMeans(res)
for (i in seq_len(nrow(res)-1)){
x_delta = res[, i] - cMeans[i]
sum_x_delta_sq = sum(x_delta^2)
for (j in (i+1):nrow(res)){
y_delta = res[, j] - cMeans[j]
tmp <- sum(x_delta * y_delta) / sqrt((sum_x_delta_sq * sum(y_delta^2)))
if (abs(tmp) > 0.05) cor_mat[i, j] <- tmp
}
}
As sparsity of a matrix increases, we can make the above more complicated but more performant by separating operations involving non-sparse elements and operations involving sparse elements:
n_row <- nrow(res)
cor_mat <- Matrix(0L,n_row, ncol(res))
cMeans <- Matrix::colMeans(res)
for (i in 1){
sp_i <- res[, i, drop = F]
i_s <- sp_i@i + 1
i_zero_delta = 0 - cMeans[i]
i_non_zero_delta = sp_i@x - cMeans[i]
sum_x_delta_sq = sum(c((n_row - length(i_s)) * i_zero_delta^2, i_non_zero_delta^2))
for (j in (i+1):nrow(res)){
sp_j <- res[, j, drop = F]
j_s <- sp_j@i + 1
j_zero_delta = 0L - cMeans[j]
j_non_zero_delta = sp_j@x - cMeans[j]
sum_y_delta_sq = sum(c((n_row - length(j_s)) * j_zero_delta^2, j_non_zero_delta^2))
common <- intersect(i_s, j_s)
only_i <- setdiff(i_s, j_s)
only_j <- setdiff(j_s, i_s)
none <- n_row - length(c(common, only_i, only_j))
numerator = 0
if (length(common) > 0) numerator = numerator + sum(i_non_zero_delta[match(common, i_s)] * j_non_zero_delta[match(common, j_s)])
if (length(only_i) > 0) numerator = numerator + j_zero_delta * sum(i_non_zero_delta[match(only_i, i_s)])
if (length(only_j) > 0) numerator = numerator + i_zero_delta * sum(j_non_zero_delta[match(only_j, j_s)])
if (length(none) > 0) numerator = numerator + i_zero_delta * j_zero_delta * none
tmp <- numerator / sqrt(sum_x_delta_sq * sum_y_delta_sq)
if (abs(tmp) > 0.05) cor_mat[i, j] <- tmp
}
}
Upvotes: 1