Efficient way of row/column sum of a IndexedRowmatrix in Apache Spark

Question

I have a matrix in a CoordinateMatrix format in Scala. The Matrix is sparse and the entires look like (upon coo_matrix.entries.collect),

Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(
  MatrixEntry(0,0,-1.0), MatrixEntry(0,1,-1.0), MatrixEntry(1,0,-1.0),
  MatrixEntry(1,1,-1.0), MatrixEntry(1,2,-1.0), MatrixEntry(2,1,-1.0), 
  MatrixEntry(2,2,-1.0), MatrixEntry(0,3,-1.0), MatrixEntry(0,4,-1.0), 
  MatrixEntry(0,5,-1.0), MatrixEntry(3,0,-1.0), MatrixEntry(4,0,-1.0), 
  MatrixEntry(3,3,-1.0), MatrixEntry(3,4,-1.0), MatrixEntry(4,3,-1.0),
  MatrixEntry(4,4,-1.0))

This is only a small sample size. The Matrix is of size a N x N (where N = 1 million) though a majority of it is sparse. What is one of the efficient way of getting row sums of this matrix in Spark Scala? The goal is to create a new RDD composed of row sums i.e. of size N where 1st element is row sum of row1 and so on ..

I can always convert this coordinateMatrix to IndexedRowMatrix and run a for loop to compute rowsums one iteration at a time, but it is not the most efficient approach.

any idea is greatly appreciated.

Efficient way of row/column sum of a IndexedRowmatrix in Apache Spark

Answers (1)

Related Questions