Pyspark PCA Implementation

Question

I am stuck in a problem where I wanna do PCA on a Pyspark Dataframe column. The name of the column is ‘features’ where each row is a SparseVector.

This is the flow:

Df - name of the pyspark df

Features - name of column

Snippet of the rdd

[Row(features=SparseVector(2,{1:50.0})),

Row(features=SparseVector(2,{0:654.0, 1:20.0}))],

from pyspark.mllib.linalg.distributed import RowMatrix
i   = RowMatrix(df.select(‘features’).rdd)
ipc = i.computePrincipalComponents(2)

Answers (1)