Reputation: 181
I have an input dataframe input_df
as:
+---------------+--------------------+
|Main_CustomerID| Vector|
+---------------+--------------------+
| 725153|[3.0,2.0,6.0,0.0,9.0|
| 873008|[4.0,1.0,0.0,1.0,...|
| 625109|[1.0,0.0,6.0,1.0,...|
| 817171|[0.0,4.0,0.0,7.0,...|
| 611498|[1.0,0.0,4.0,5.0,...|
+---------------+--------------------+
The input_df
is of schema type,
root
|-- Main_CustomerID: integer (nullable = true)
|-- Vector: vector (nullable = true)
By referring to Calculate Cosine Similarity Spark Dataframe, I have created the indexed row matrix and then I do:
val lm = irm.toIndexedRowMatrix.toBlockMatrix.toLocalMatrix
to find cosine similarity between columns. Now I have a resultant mllib
matrix,
cosineSimilarity: org.apache.spark.mllib.linalg.Matrix =
0.0 0.4199605255658081 0.5744269579035528 0.22075539284417395 0.561434614044346
0.0 0.0 0.2791452631195413 0.7259079527665503 0.6206918387272496
0.0 0.0 0.0 0.31792539222893695 0.6997167152675132
0.0 0.0 0.0 0.0 0.6776404124278828
0.0 0.0 0.0 0.0 0.0
Now, I need to convert this lm
which is of type org.apache.spark.mllib.linalg.Matrix
into a dataframe. I expect my output dataframe
to look as follows:
+---+------------------+------------------+-------------------+------------------+
| _1| _2| _3| _4| _5|
+---+------------------+------------------+-------------------+------------------+
|0.0|0.4199605255658081|0.5744269579035528|0.22075539284417395| 0.561434614044346|
|0.0| 0.0|0.2791452631195413| 0.7259079527665503|0.6206918387272496|
|0.0| 0.0| 0.0|0.31792539222893695|0.6997167152675132|
|0.0| 0.0| 0.0| 0.0|0.6776404124278828|
|0.0| 0.0| 0.0| 0.0| 0.0|
+---+------------------+------------------+-------------------+------------------+
How can I do this in Scala?
Upvotes: 1
Views: 984
Reputation: 28322
To convert the Matrix
to a dataframe as specified, do the following. It first converts the matrix to a dataframe containing a single column with an array. Then foldLeft
is used to break the array into separate columns.
import spark.implicits._
val cols = (0 until lm.numCols).toSeq
val df = lm.transpose
.colIter.toSeq
.map(_.toArray)
.toDF("arr")
val df2 = cols.foldLeft(df)((df, i) => df.withColumn("_" + (i+1), $"arr"(i)))
.drop("arr")
Upvotes: 2