Reputation: 335
I created a CoordinateMatrix:
import org.apache.spark.mllib.linalg.distributed.{
CoordinateMatrix, MatrixEntry}
val entries = sc.parallelize(Seq(
MatrixEntry(0, 1, 1), MatrixEntry(0, 2, 2), MatrixEntry(0, 3, 3),
MatrixEntry(0, 4, 4), MatrixEntry(2, 3, 5), MatrixEntry(2, 4, 6),
MatrixEntry(3, 4, 7)))
val mat: CoordinateMatrix = new CoordinateMatrix(entries)
which is
0 1 2 3 4
0 0 0 0 0
0 0 0 5 6
0 0 0 0 7
And then I want to print this matrix. I first convert it to IndexedRowMatrix (order of rows is important for me and I cannot lose any row in the matrix):
scala> mat.toIndexedRowMatrix.rows.collect.sortBy(_.index)
res8: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] =
Array(IndexedRow(0,(5,[1,2,3,4],[1.0,2.0,3.0,4.0])), IndexedRow(2,(5,[3,4],[5.0,6.0])), IndexedRow(3,(5,[4],[7.0])))
But in this result the second row is dropped because all the entries are 0. So I cannot go further to print the matrix (or convert the matrix to Array[Array[Double]]). I don't know how to deal with this, thank you.
Upvotes: 1
Views: 922
Reputation: 330413
In general if you need a distributed matrix then collecting and printing is simply not an option. Still you can covert your data to BlockMatrix
and collect as a local DenseMatrix
as follows:
mat.toBlockMatrix.toLocalMatrix
// res1: org.apache.spark.mllib.linalg.Matrix =
// 0.0 1.0 2.0 3.0 4.0
// 0.0 0.0 0.0 0.0 0.0
// 0.0 0.0 0.0 5.0 6.0
// 0.0 0.0 0.0 0.0 7.0
Upvotes: 2