Andrew Long
Andrew Long

Reputation: 933

How is the VectorAssembler used with Sparks Correlation util?

I'm trying to correlate a couple columns of a dataframe in spark scala by piping the columns of the original dataframe into the VectorAssembler followed by the Correlation util. For some reason the Vector assembler seems to be producing empty vectors as seen below. Here's what I have so far.

    val numericalCols = Array(
      "price", "bedrooms", "bathrooms", 
      "sqft_living", "sqft_lot"
    )

    val data: DataFrame = HousingDataReader(spark)
    data.printSchema()
    /*
...
 |-- price: decimal(38,18) (nullable = true)
 |-- bedrooms: decimal(38,18) (nullable = true)
 |-- bathrooms: decimal(38,18) (nullable = true)
 |-- sqft_living: decimal(38,18) (nullable = true)
 |-- sqft_lot: decimal(38,18) (nullable = true)
...
     */

    println("total record:"+data.count()) //total record:21613

    val assembler = new VectorAssembler().setInputCols(numericalCols)
         .setOutputCol("features").setHandleInvalid("skip")


    val df = assembler.transform(data).select("features","price")
    df.printSchema()
    /*
 |-- features: vector (nullable = true)
 |-- price: decimal(38,18) (nullable = true)
     */
    df.show
    /*  THIS IS ODD
+--------+-----+
|features|price|
+--------+-----+
+--------+-----+
     */
    println("df row count:" + df.count())
    // df row count:21613
    val Row(coeff1: Matrix) = Correlation.corr(df, "features").head  //ERROR HERE
    
    println("Pearson correlation matrix:\n" + coeff1.toString)

this ends up with the following exception

java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.

    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
    at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
    at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
    at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
    at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:73)
    at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:84)
    at 

Upvotes: 1

Views: 1137

Answers (1)

Raghu
Raghu

Reputation: 1712

Looks like any one of your feature columns contains a null value always. setHandleInvalid("skip") will skip any row that contains null in one of the features. Can you try filling the null values with fillna(0) and check the result. This must solve your issue.

Upvotes: 0

Related Questions