Ryan C. Thompson
Ryan C. Thompson

Reputation: 42030

Computing row sums of a big.matrix in R?

I have a big matrix with about 60 million rows and 150 colums (roughly 9 billion elements total). I have stored this data in a big.matrix object (from package bigmemory). Now, I wish to compute the sum of each row, which is a problem because big.matrix is column-oriented, so as far as I can tell all the summary functions are column oriented (e.g. colsum, colmax, etc.) and there is no function available by default for computing row sums. Of course I can do apply(x, 1, sum), but this will take a very long time. I can also loop over the columns one by one and use vectorized addition to add them:

mysum <- rep(0, nrow(x))
for (i in seq(ncol(x))) 
  mysum <- mysum + x[,i]

but this still takes over 20 minutes, and is obviously suboptimal since it is creating a new 60-million-element vector each time through the loop. It seems like there must be some faster way to do this.

Edit

I got this down to 10 minutes by processing chunks of a million or so rows at a time, and calling rowSums on those, and then concatenating the results. I'd still be interested to know if there is an optimized way to do this, though.

Upvotes: 2

Views: 1534

Answers (1)

Scott Ritchie
Scott Ritchie

Reputation: 10543

I've written some C++ code to do this, adapted from the bigmemory Rcpp gallery:

rowSums.cpp

// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::depends(BH, bigmemory)]]
#include <bigmemory/MatrixAccessor.hpp>

#include <numeric>

// Logic for BigRowSums.
template <typename T>
NumericVector BigRowSums(XPtr<BigMatrix> pMat, MatrixAccessor<T> mat) {
    NumericVector rowSums(pMat->nrow(), 0.0);
    NumericVector value(1);
    for (int jj = 0; jj < pMat->ncol(); jj++) {
      for (int ii = 0; ii < pMat->nrow(); ii++) {
        value = mat[jj][ii];
        if (all(!is_na(value))) {
          rowSums[ii] += value[0];
        }   
      }   
    }   
    return rowSums;
}

// Dispatch function for BigRowSums
//
// [[Rcpp::export]]
NumericVector BigRowSums(SEXP pBigMat) {
    XPtr<BigMatrix> xpMat(pBigMat);

    switch(xpMat->matrix_type()) {
      case 1:
        return BigRowSums(xpMat, MatrixAccessor<char>(*xpMat));
      case 2:
        return BigRowSums(xpMat, MatrixAccessor<short>(*xpMat));
      case 4:
        return BigRowSums(xpMat, MatrixAccessor<int>(*xpMat));
      case 6:
        return BigRowSums(xpMat, MatrixAccessor<float>(*xpMat));
      case 8:
        return BigRowSums(xpMat, MatrixAccessor<double>(*xpMat));
      default:
        throw Rcpp::exception("unknown type detected for big.matrix object!");
    }   
}

In R:

library(bigmemory)
library(Rcpp)
sourceCpp("rowSums.cpp")

m <- as.big.matrix(matrix(1:9, 3))
BigRowSums(m@address)
[1] 12 15 18

Upvotes: 2

Related Questions