Reputation: 335
In Spark-shell, I created a CoordinateMatrix:
import org.apache.spark.mllib.linalg.distributed.{
CoordinateMatrix, MatrixEntry}
val entries = sc.parallelize(Seq(
Array(0, 1, 1), Array(0, 2, 2), Array(0, 3, 3),
Array(0, 4, 4), Array(1, 2, 5), Array(1, 3, 6),
Array(1, 4, 7), Array(2, 3, 8), Array(2, 4, 9),
Array(3, 4, 10))).map(f => MatrixEntry(f(0), f(1), f(2)))
val mat: CoordinateMatrix = new CoordinateMatrix(entries)
which is:
0 1 2 3 4
0 0 5 6 7
0 0 0 8 9
0 0 0 0 10
Now I want to convert it to RowMatrix and see the entries:
scala> mat.toRowMatrix.rows.collect
res1: Array[org.apache.spark.mllib.linalg.Vector] = Array((5,[1,2,3,4],[1.0,2.0,3.0,4.0]), (5,[2,3,4],[5.0,6.0,7.0]), (5,[4],[10.0]), (5,[3,4],[8.0,9.0]))
It is strange that the third and forth row are exchanged in RowMatrix. What's the problem with that? Thanks.
Upvotes: 2
Views: 543
Reputation: 330413
It is not strange. As you can read in the API docs, RowMatrix
:
Represents a row-oriented distributed Matrix with no meaningful row indices.
Moreover converting CoordinateMatrix
to any other type of distributed matrix requires repartitioning and the order of the output rows / blocks depends partially on the number of partitions and dimensions of the matrix but beyond that it is not deterministic.
If order of rows is important you should use IndexedRowMatrix
. It still doesn't guarantee the order of the rows but IndexedRow
preserves indices which can be used to reorder rows if necessary.
Upvotes: 1