Reputation: 933
I'm trying to correlate a couple columns of a dataframe in spark scala by piping the columns of the original dataframe into the VectorAssembler followed by the Correlation util. For some reason the Vector assembler seems to be producing empty vectors as seen below. Here's what I have so far.
val numericalCols = Array(
"price", "bedrooms", "bathrooms",
"sqft_living", "sqft_lot"
)
val data: DataFrame = HousingDataReader(spark)
data.printSchema()
/*
...
|-- price: decimal(38,18) (nullable = true)
|-- bedrooms: decimal(38,18) (nullable = true)
|-- bathrooms: decimal(38,18) (nullable = true)
|-- sqft_living: decimal(38,18) (nullable = true)
|-- sqft_lot: decimal(38,18) (nullable = true)
...
*/
println("total record:"+data.count()) //total record:21613
val assembler = new VectorAssembler().setInputCols(numericalCols)
.setOutputCol("features").setHandleInvalid("skip")
val df = assembler.transform(data).select("features","price")
df.printSchema()
/*
|-- features: vector (nullable = true)
|-- price: decimal(38,18) (nullable = true)
*/
df.show
/* THIS IS ODD
+--------+-----+
|features|price|
+--------+-----+
+--------+-----+
*/
println("df row count:" + df.count())
// df row count:21613
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head //ERROR HERE
println("Pearson correlation matrix:\n" + coeff1.toString)
this ends up with the following exception
java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:73)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:84)
at
Upvotes: 1
Views: 1137
Reputation: 1712
Looks like any one of your feature columns contains a null value always. setHandleInvalid("skip") will skip any row that contains null in one of the features. Can you try filling the null values with fillna(0) and check the result. This must solve your issue.
Upvotes: 0