HPZ001
HPZ001

Reputation: 21

Pyspark PCA Implementation

I am stuck in a problem where I wanna do PCA on a Pyspark Dataframe column. The name of the column is ‘features’ where each row is a SparseVector.

This is the flow:

Df - name of the pyspark df

Features - name of column

Snippet of the rdd

[Row(features=SparseVector(2,{1:50.0})),

Row(features=SparseVector(2,{0:654.0, 1:20.0}))],

from pyspark.mllib.linalg.distributed import RowMatrix
i   = RowMatrix(df.select(‘features’).rdd)
ipc = i.computePrincipalComponents(2)

Error Message

Upvotes: 2

Views: 351

Answers (1)

pissall
pissall

Reputation: 7399

You are getting an RDD[Row] object where your Row is Row(features=SparseVector(2,{1:50.0})).

You need an RDD[SparseVector], so you should change your line:

i = RowMatrix(df.select(‘features’).rdd)

to

i = RowMatrix(df.select(‘features’).rdd.map(lambda x: x[0]))

which will return RDD[SparseVector]

Upvotes: 1

Related Questions