Reputation: 133
I have a matrix in a CoordinateMatrix format in Scala. The Matrix is sparse and the entires look like (upon coo_matrix.entries.collect),
Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(
MatrixEntry(0,0,-1.0), MatrixEntry(0,1,-1.0), MatrixEntry(1,0,-1.0),
MatrixEntry(1,1,-1.0), MatrixEntry(1,2,-1.0), MatrixEntry(2,1,-1.0),
MatrixEntry(2,2,-1.0), MatrixEntry(0,3,-1.0), MatrixEntry(0,4,-1.0),
MatrixEntry(0,5,-1.0), MatrixEntry(3,0,-1.0), MatrixEntry(4,0,-1.0),
MatrixEntry(3,3,-1.0), MatrixEntry(3,4,-1.0), MatrixEntry(4,3,-1.0),
MatrixEntry(4,4,-1.0))
This is only a small sample size. The Matrix is of size a N x N (where N = 1 million) though a majority of it is sparse. What is one of the efficient way of getting row sums of this matrix in Spark Scala? The goal is to create a new RDD composed of row sums i.e. of size N where 1st element is row sum of row1 and so on ..
I can always convert this coordinateMatrix to IndexedRowMatrix and run a for loop to compute rowsums one iteration at a time, but it is not the most efficient approach.
any idea is greatly appreciated.
Upvotes: 1
Views: 2431
Reputation: 330193
It will be quite expensive due to shuffling (this is the part you cannot really avoid here) but you can convert entries to PairRDD
and reduce by key:
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix}
import org.apache.spark.rdd.RDD
val mat: CoordinateMatrix = ???
val rowSums: RDD[Long, Double)] = mat.entries
.map{case MatrixEntry(row, _, value) => (row, value)}
.reduceByKey(_ + _)
Unlike solution based on indexedRowMatrix
:
import org.apache.spark.mllib.linalg.distributed.IndexedRow
mat.toIndexedRowMatrix.rows.map{
case IndexedRow(i, values) => (i, values.toArray.sum)
}
it requires no groupBy
transformation or intermediate SparseVectors
.
Upvotes: 4