Reputation: 21
I am stuck in a problem where I wanna do PCA on a Pyspark Dataframe column. The name of the column is ‘features’ where each row is a SparseVector.
This is the flow:
Df - name of the pyspark df
Features - name of column
Snippet of the rdd
[Row(features=SparseVector(2,{1:50.0})),
Row(features=SparseVector(2,{0:654.0, 1:20.0}))],
from pyspark.mllib.linalg.distributed import RowMatrix
i = RowMatrix(df.select(‘features’).rdd)
ipc = i.computePrincipalComponents(2)
Upvotes: 2
Views: 351
Reputation: 7399
You are getting an RDD[Row]
object where your Row
is Row(features=SparseVector(2,{1:50.0}))
.
You need an RDD[SparseVector]
, so you should change your line:
i = RowMatrix(df.select(‘features’).rdd)
to
i = RowMatrix(df.select(‘features’).rdd.map(lambda x: x[0]))
which will return RDD[SparseVector]
Upvotes: 1