Reputation: 47

How to split a sparse matrix into two

I have a large sparse matrix which I need to correlate which haven't been possible for me because:

I cant convert the sparse matrix to a dense matrix due to R's memory limitation
I tried using packages bigstats and bigmemory, my R froze over (using a windows 10, 8GB laptop)
There's no correlation function in R's Matrix package

N.B: I am trying to correlate a sparse matrix in the format 'M'X + X'M - M'M' which isn't possible which is why I am trying to split the sparse matrix into two or three, turn to dense matrix using as.matrix() then correlate using the cor() then combine the correlated results into one using cbind()

Now:

I want to ask if it's possible to split a sparse matrix into two or three parts, convert to dense matrices then correlate each dense matrix then cbind the two or three dense matrices into one then export to a text file.

What function can I use to split a sparse matrix into two or three bearing in mind that both the i and p parts of the sparse matrix are equal sizes with the same dim

Formal class 'dgCMatrix' [package "Matrix"] with 7 slots
  ..@ i       : int [1:73075722]  ...
  ..@ p       : int [1:73075722] 0 0 1 1 1 1 1 2 2 2 ...
  ..@ Dim     : int [1:2] 500232 500232
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:73075722]  ...
  ..@ uplo    : chr "L"
  ..@ factors : list()

The correlation output will be in this format:

         [,1]       [,2]       [,3]        [,4]
[1,]  1.00000000 -0.8343860  0.3612926  0.09678096
[2,] -0.83438600  1.0000000 -0.8154071  0.24611830
[3,]  0.36129256 -0.8154071  1.0000000 -0.51801346
[4,]  0.09678096  0.2461183 -0.5180135  1.00000000
[5,]  0.67411584 -0.3560782 -0.1056124  0.60987601
[6,]  0.23071712 -0.4457467  0.5117711  0.21848068
[7,]  0.49200080 -0.4246502  0.2016633  0.46971736
           [,5]       [,6]       [,7]
[1,]  0.6741158  0.2307171  0.4920008
[2,] -0.3560782 -0.4457467 -0.4246502
[3,] -0.1056124  0.5117711  0.2016633
[4,]  0.6098760  0.2184807  0.4697174
[5,]  1.0000000  0.2007979  0.7198228
[6,]  0.2007979  1.0000000  0.6965899
[7,]  0.7198228  0.6965899  1.0000000

Upvotes: 0

Answers (2)

Nh_code

Reputation: 11

I implement dicing of a sparse_matrix by row indexing:

as.DF_spM <- function(data.use,chun_size="20000000",dg_class="dgCMatrix") {
    lapply(  split(seq(nrow(data.use)), (seq(nrow(data.use))-1) %/% as.numeric(chun_size) )  , function(nx) {
        switch(dg_class,
               dgTMatrix = {as(as.matrix(data.use[nx,]),"dgTMatrix")},
               dgCMatrix = {as(as.matrix(data.use[nx,]),"dgCMatrix")},
               dgRMatrix = {as(as.matrix(data.use[nx,]),"dgRMatrix")}
               )
        }) %>% Matrix.utils::rBind.fill()
}

Upvotes: 0

Cole

Reputation: 11255

The cor() function is a way to efficiently calculate the pearson correlation between all of the columns. cor is very efficient but if we do not mind a hit in efficiency, we can calculate the pearson correlation manually:

n_row <- nrow(res)
cor_mat <- Matrix(0L,n_row, ncol(res))
cMeans <- Matrix::colMeans(res)

for (i in seq_len(nrow(res)-1)){
  x_delta = res[, i] - cMeans[i]
  sum_x_delta_sq = sum(x_delta^2)

  for (j in (i+1):nrow(res)){
    y_delta = res[, j] - cMeans[j]

    tmp <- sum(x_delta * y_delta) / sqrt((sum_x_delta_sq * sum(y_delta^2)))
    if (abs(tmp) > 0.05) cor_mat[i, j] <- tmp
  }
}

As sparsity of a matrix increases, we can make the above more complicated but more performant by separating operations involving non-sparse elements and operations involving sparse elements:

n_row <- nrow(res)

cor_mat <- Matrix(0L,n_row, ncol(res))
cMeans <- Matrix::colMeans(res)

for (i in 1){

  sp_i <- res[, i, drop = F]
  i_s <- sp_i@i + 1

  i_zero_delta = 0 - cMeans[i]
  i_non_zero_delta = sp_i@x - cMeans[i]

  sum_x_delta_sq = sum(c((n_row - length(i_s)) * i_zero_delta^2, i_non_zero_delta^2))

  for (j in (i+1):nrow(res)){

    sp_j <- res[, j, drop = F]
    j_s <- sp_j@i + 1

    j_zero_delta = 0L - cMeans[j]
    j_non_zero_delta = sp_j@x - cMeans[j]

    sum_y_delta_sq = sum(c((n_row - length(j_s)) * j_zero_delta^2, j_non_zero_delta^2))

    common <- intersect(i_s, j_s)
    only_i <- setdiff(i_s, j_s)
    only_j <- setdiff(j_s, i_s)
    none <- n_row - length(c(common, only_i, only_j))

    numerator = 0

    if (length(common) > 0) numerator = numerator + sum(i_non_zero_delta[match(common, i_s)] * j_non_zero_delta[match(common, j_s)])
    if (length(only_i) > 0) numerator = numerator + j_zero_delta * sum(i_non_zero_delta[match(only_i, i_s)])
    if (length(only_j) > 0) numerator = numerator + i_zero_delta * sum(j_non_zero_delta[match(only_j, j_s)])
    if (length(none) > 0)   numerator = numerator + i_zero_delta * j_zero_delta * none

    tmp <- numerator / sqrt(sum_x_delta_sq * sum_y_delta_sq)
    if (abs(tmp) > 0.05) cor_mat[i, j] <- tmp
  }
}

Upvotes: 1

How to split a sparse matrix into two

Answers (2)

Related Questions