Leszek Malinowski
Leszek Malinowski

Reputation: 111

Is it possible to correctly calculate SVD on IndexedRowMatrix in Spark?

I've got a IndexedRowMatrix [m x n], which contains only X non-zero rows. I'm setting k = 3.

When I try to calculate SVD on this object with computeU set to true, dimensions of U matrix are [m x n], when the correct dimensions are [m x k].

Why does it happen?

I've already tried converting IndexedRowMatrix to RowMatrix and then calculating SVD. The result dimensions are [X x k], so it only calculates result for non-zero rows (matrix is dropping indices, as in documentation).

Is it possible to convert this matrix, but with keeping rows indices?

    val csv = sc.textFile("hdfs://spark/nlp/merged_sparse.csv").cache()  // original file

    val data = csv.mapPartitions(lines => {
        val parser = new CSVParser(' ')
        lines.map(line => {
          parser.parseLine(line)
        })
      }).map(line => {
        MatrixEntry(line(0).toLong - 1, line(1).toLong - 1 , line(2).toInt) 
      }
    )

    val coordinateMatrix: CoordinateMatrix = new CoordinateMatrix(data)
    val indexedRowMatrix: IndexedRowMatrix = coordinateMatrix.toIndexedRowMatrix()
    val rowMatrix: RowMatrix = indexedRowMatrix.toRowMatrix()


    val svd: SingularValueDecomposition[RowMatrix, Matrix] = rowMatrix.computeSVD(3, computeU = true, 1e-9)

    val U: RowMatrix = svd.U // The U factor is a RowMatrix.
    val S: Vector = svd.s // The singular values are stored in a local dense vector.
    val V: Matrix = svd.V // The V factor is a local dense matrix.

    val indexedSvd: SingularValueDecomposition[IndexedRowMatrix, Matrix] = indexedRowMatrix.computeSVD(3, computeU = true, 1e-9)

    val indexedU: IndexedRowMatrix = indexedSvd.U // The U factor is a RowMatrix.
    val indexedS: Vector = indexedSvd.s // The singular values are stored in a local dense vector.
    val indexedV: Matrix = indexedSvd.V // The V factor is a local dense matrix.

Upvotes: 1

Views: 642

Answers (1)

Noah
Noah

Reputation: 13959

It looks like this is a bug in Spark MLlib. If you you get the size of a row vector in your indexed matrix it will correctly return 3 columns:

indexedU.rows.first().vector.size

I looked at the source and it looks like they're incorrectly copying the current number of columns from the indexed matrix:

val U = if (computeU) {
  val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
    IndexedRow(i, v)
  }
  new IndexedRowMatrix(indexedRows, nRows, nCols) //nCols is incorrect here
} else {
  null
}

Looks like a prime candidate for a bugfix/pull request.

Upvotes: 1

Related Questions