Reputation: 111
I've got a IndexedRowMatrix
[m x n], which contains only X non-zero rows. I'm setting k = 3.
When I try to calculate SVD on this object with computeU set to true, dimensions of U matrix are [m x n], when the correct dimensions are [m x k].
Why does it happen?
I've already tried converting IndexedRowMatrix
to RowMatrix
and then calculating SVD. The result dimensions are [X x k], so it only calculates result for non-zero rows (matrix is dropping indices, as in documentation).
Is it possible to convert this matrix, but with keeping rows indices?
val csv = sc.textFile("hdfs://spark/nlp/merged_sparse.csv").cache() // original file
val data = csv.mapPartitions(lines => {
val parser = new CSVParser(' ')
lines.map(line => {
parser.parseLine(line)
})
}).map(line => {
MatrixEntry(line(0).toLong - 1, line(1).toLong - 1 , line(2).toInt)
}
)
val coordinateMatrix: CoordinateMatrix = new CoordinateMatrix(data)
val indexedRowMatrix: IndexedRowMatrix = coordinateMatrix.toIndexedRowMatrix()
val rowMatrix: RowMatrix = indexedRowMatrix.toRowMatrix()
val svd: SingularValueDecomposition[RowMatrix, Matrix] = rowMatrix.computeSVD(3, computeU = true, 1e-9)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val S: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val indexedSvd: SingularValueDecomposition[IndexedRowMatrix, Matrix] = indexedRowMatrix.computeSVD(3, computeU = true, 1e-9)
val indexedU: IndexedRowMatrix = indexedSvd.U // The U factor is a RowMatrix.
val indexedS: Vector = indexedSvd.s // The singular values are stored in a local dense vector.
val indexedV: Matrix = indexedSvd.V // The V factor is a local dense matrix.
Upvotes: 1
Views: 642
Reputation: 13959
It looks like this is a bug in Spark MLlib. If you you get the size of a row vector in your indexed matrix it will correctly return 3 columns:
indexedU.rows.first().vector.size
I looked at the source and it looks like they're incorrectly copying the current number of columns from the indexed matrix:
val U = if (computeU) {
val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
IndexedRow(i, v)
}
new IndexedRowMatrix(indexedRows, nRows, nCols) //nCols is incorrect here
} else {
null
}
Looks like a prime candidate for a bugfix/pull request.
Upvotes: 1