I have build a PySpark dataframe using:
data ='data.csv' , format='com.databricks.spark.csv', delimiter = ',' ,header='true',inferSchema='true')
And I want to perform PCA on my dataframe my dataframe schema is
DataFrame[col0: double, col1: double, col2: double, col3: double, col4: double]
| col0| col1| col2| col3| col4|
| -8.801490628| -1.68848604044| 6.29108688718| 1.68614762629| -2.78418041902|
| 6.99040350558| -2.79455708195| -5.57115314522| 4.22337477957|-0.366589003047|
| 6.8950808389| 7.65514024658| 8.0214838208| -5.12100927058| 3.17467779733|
| 6.74150161414| 1.19627062139| 0.821181991602| 5.12589137044| -3.86248588187|
| 9.15545404244| 7.80553468656| -8.1232517076| 2.6242726214| -7.59049824307|
| -6.014643738|-0.470165781449|-0.226389435704| -2.55837378209| -2.06405566854|
| -9.49629160445| -9.85331556717| -7.44474566663| 6.48359295657| 9.75680835864|
| 0.450876020546| -3.55454445478| -2.82100689682| 5.15104966779| -7.70810268078|
| -7.21960567005| 0.102168086158| -1.46779736909| -3.87897074493| -3.17592118456|
| -8.75820987524| -8.63519048007| -4.20447284625|-0.394878764685| -5.79070138764|
| 9.47825273869| 6.02827892008| -9.7181540689| -9.0341215112| 5.96203870171|
| -1.56616611175| 1.64353225245| 9.20883287312|-0.158689954569| 4.92646032432|
|-0.952144934546| -2.9114138684| 2.99204980215| -4.64479019591| -5.99952901402|
| 3.55670956201|-0.812146671595| -1.81243042667| -1.0765836636| 4.9669633757|
| -2.28427448245| 0.982018554172| 2.2453332695| 1.02432988704| -7.42272905399|
| 5.5901346625| 9.7266134961| 0.372411854139| 4.62762920616| -7.39599025974|
| 9.54828822231| -2.99982461624| 2.17542923571| 6.98459564802| 4.17077742377|
| -6.93309333389| 6.54244346903| 0.783827506295| 4.51631424946| 5.14605443379|
| -1.39844067044| 5.94842772889| 0.270728638304| 4.71245951003| 7.60767471606|
| -7.45885401935| -2.17059549479| 9.13976371571| -7.59189334493| -2.3924001937|
To do that I have to work with
so this is how I am doing it
dataPCA = PCA(k=2, inputCol=str(data.columns), outputCol="pcaFeatures")
model =
and I am getting this error:
pyspark.sql.utils.IllegalArgumentException: u'Field "[\'col0\', \'col1\', \'col2\', \'col3\', \'col4\']" does not exist.
what's wrong and how to fix that?
Upvotes: 1
Views: 3905
Reputation: 21
As mentioned by mkaran PCA requires a Vector
column as an input. You have to assemble your data first, for example using VectorAsssembler
or RFormula
Please follow the examples in Encode and assemble multiple features in PySpark for details.
data = RFormula(formula=" ~ {0}".format(" + ".join(data.columns))).fit(data).transform(data)
Upvotes: 2