Xiaoyu Chen
Xiaoyu Chen

Reputation: 335

Converting CoordinateMatrix to RowMatrix doesn't preserve row order

In Spark-shell, I created a CoordinateMatrix:

import org.apache.spark.mllib.linalg.distributed.{
  CoordinateMatrix, MatrixEntry}
val entries = sc.parallelize(Seq(
  Array(0, 1, 1), Array(0, 2, 2), Array(0, 3, 3), 
  Array(0, 4, 4), Array(1, 2, 5), Array(1, 3, 6),
  Array(1, 4, 7), Array(2, 3, 8), Array(2, 4, 9),
  Array(3, 4, 10))).map(f => MatrixEntry(f(0), f(1), f(2)))

val mat: CoordinateMatrix = new CoordinateMatrix(entries)

which is:

0 1 2 3 4
0 0 5 6 7
0 0 0 8 9
0 0 0 0 10

Now I want to convert it to RowMatrix and see the entries:

scala> mat.toRowMatrix.rows.collect
res1: Array[org.apache.spark.mllib.linalg.Vector] = Array((5,[1,2,3,4],[1.0,2.0,3.0,4.0]), (5,[2,3,4],[5.0,6.0,7.0]), (5,[4],[10.0]), (5,[3,4],[8.0,9.0]))

It is strange that the third and forth row are exchanged in RowMatrix. What's the problem with that? Thanks.

Upvotes: 2

Views: 543

Answers (1)

zero323
zero323

Reputation: 330413

It is not strange. As you can read in the API docs, RowMatrix:

Represents a row-oriented distributed Matrix with no meaningful row indices.

Moreover converting CoordinateMatrix to any other type of distributed matrix requires repartitioning and the order of the output rows / blocks depends partially on the number of partitions and dimensions of the matrix but beyond that it is not deterministic.

If order of rows is important you should use IndexedRowMatrix. It still doesn't guarantee the order of the rows but IndexedRow preserves indices which can be used to reorder rows if necessary.

Upvotes: 1

Related Questions