user5492457
user5492457

Reputation:

Perform PCA on pyspark dataframe

I have build a PySpark dataframe using:

data = sqlContext.read.load('data.csv' , format='com.databricks.spark.csv', delimiter = ',' ,header='true',inferSchema='true') 

And I want to perform PCA on my dataframe my dataframe schema is

>>data
DataFrame[col0: double, col1: double, col2: double, col3: double, col4: double]

>>> data.show()
+---------------+---------------+---------------+---------------+---------------+
|           col0|           col1|           col2|           col3|           col4|
+---------------+---------------+---------------+---------------+---------------+
|   -8.801490628| -1.68848604044|  6.29108688718|  1.68614762629| -2.78418041902|
|  6.99040350558| -2.79455708195| -5.57115314522|  4.22337477957|-0.366589003047|
|   6.8950808389|  7.65514024658|   8.0214838208| -5.12100927058|  3.17467779733|
|  6.74150161414|  1.19627062139| 0.821181991602|  5.12589137044| -3.86248588187|
|  9.15545404244|  7.80553468656|  -8.1232517076|   2.6242726214| -7.59049824307|
|   -6.014643738|-0.470165781449|-0.226389435704| -2.55837378209| -2.06405566854|
| -9.49629160445| -9.85331556717| -7.44474566663|  6.48359295657|  9.75680835864|
| 0.450876020546| -3.55454445478| -2.82100689682|  5.15104966779| -7.70810268078|
| -7.21960567005| 0.102168086158| -1.46779736909| -3.87897074493| -3.17592118456|
| -8.75820987524| -8.63519048007| -4.20447284625|-0.394878764685| -5.79070138764|
|  9.47825273869|  6.02827892008|  -9.7181540689|  -9.0341215112|  5.96203870171|
| -1.56616611175|  1.64353225245|  9.20883287312|-0.158689954569|  4.92646032432|
|-0.952144934546|  -2.9114138684|  2.99204980215| -4.64479019591| -5.99952901402|
|  3.55670956201|-0.812146671595| -1.81243042667|  -1.0765836636|   4.9669633757|
| -2.28427448245| 0.982018554172|   2.2453332695|  1.02432988704| -7.42272905399|
|   5.5901346625|   9.7266134961| 0.372411854139|  4.62762920616| -7.39599025974|
|  9.54828822231| -2.99982461624|  2.17542923571|  6.98459564802|  4.17077742377|
| -6.93309333389|  6.54244346903| 0.783827506295|  4.51631424946|  5.14605443379|
| -1.39844067044|  5.94842772889| 0.270728638304|  4.71245951003|  7.60767471606|
| -7.45885401935| -2.17059549479|  9.13976371571| -7.59189334493|  -2.3924001937|
+---------------+---------------+---------------+---------------+---------------+

To do that I have to work with pyspark.ml.feature so this is how I am doing it

dataPCA = PCA(k=2, inputCol=str(data.columns), outputCol="pcaFeatures")
model = dataPCA.fit(data)

and I am getting this error:

pyspark.sql.utils.IllegalArgumentException: u'Field "[\'col0\', \'col1\', \'col2\', \'col3\', \'col4\']" does not exist.

what's wrong and how to fix that?

Upvotes: 1

Views: 3905

Answers (1)

user8944954
user8944954

Reputation: 21

As mentioned by mkaran PCA requires a Vector column as an input. You have to assemble your data first, for example using VectorAsssembler or RFormula.

Please follow the examples in Encode and assemble multiple features in PySpark for details.

data = RFormula(formula=" ~ {0}".format(" + ".join(data.columns))).fit(data).transform(data)
dataPCA.setInputCol("features").fit(data).transform(data)

Upvotes: 2

Related Questions