How is the VectorAssembler used with Sparks Correlation util?

Question

I'm trying to correlate a couple columns of a dataframe in spark scala by piping the columns of the original dataframe into the VectorAssembler followed by the Correlation util. For some reason the Vector assembler seems to be producing empty vectors as seen below. Here's what I have so far.

    val numericalCols = Array(
      "price", "bedrooms", "bathrooms", 
      "sqft_living", "sqft_lot"
    )

    val data: DataFrame = HousingDataReader(spark)
    data.printSchema()
    /*
...
 |-- price: decimal(38,18) (nullable = true)
 |-- bedrooms: decimal(38,18) (nullable = true)
 |-- bathrooms: decimal(38,18) (nullable = true)
 |-- sqft_living: decimal(38,18) (nullable = true)
 |-- sqft_lot: decimal(38,18) (nullable = true)
...
     */

    println("total record:"+data.count()) //total record:21613

    val assembler = new VectorAssembler().setInputCols(numericalCols)
         .setOutputCol("features").setHandleInvalid("skip")


    val df = assembler.transform(data).select("features","price")
    df.printSchema()
    /*
 |-- features: vector (nullable = true)
 |-- price: decimal(38,18) (nullable = true)
     */
    df.show
    /*  THIS IS ODD
+--------+-----+
|features|price|
+--------+-----+
+--------+-----+
     */
    println("df row count:" + df.count())
    // df row count:21613
    val Row(coeff1: Matrix) = Correlation.corr(df, "features").head  //ERROR HERE
    
    println("Pearson correlation matrix:
" + coeff1.toString)

this ends up with the following exception

java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.

    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
    at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
    at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
    at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
    at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:73)
    at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:84)
    at

How is the VectorAssembler used with Sparks Correlation util?

Answers (1)

Related Questions