Error Calculating Spark DataFrame Calculate Standard Deviation

Question

I have the following simple function where I'm filling all the columns missing values with a 0 and then calculating the Standard deviation. I know I could use the describe function, but I wanted to use this one for my purpose.

def stdDevAllColumns(df: DataFrame): DataFrame = {
  df.select(df.columns.map(c => df.select(c).na.fill(0).agg(stddev(c))): _*)
}

This compiles, but results in a failure when run:

:143: error: overloaded method value select with alternatives:
  [U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] 
  (col: String,cols: String*)org.apache.spark.sql.DataFrame 
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (org.apache.spark.sql.DataFrame)
         df.select(df.columns.map(c => df.select(c).na.fill(0).agg(stddev(c))): _*)

Any suggestions on what this could point out to?

EDIT: There is a much simpler way to do this as below:

z.show(df.summary("stddev"))

But nevertheless, I would like to know what the problem is with my implementation function above?

Jarrod Baker · Accepted Answer

The error is occuring because Spark can't find a method on DataFrame with the correct signature. Sometimes it helps to break down the expression into smaller parts to see what is happening:

val df: DataFrame = ???
val fn: Array[DataFrame] = df.columns.map(c => df.select(c).na.fill(0).agg(stddev(c)))
def stdDevAllColumns(df: DataFrame): DataFrame = {
  df.select(fn: _*) // compiler correctly complains
}

When we extract your select expression into a variable fn the compiler correctly infers that it has type Array[DataFrame]. There is no select method on DataFrame that takes this type as an input.

Error Calculating Spark DataFrame Calculate Standard Deviation

Answers (1)

Related Questions