Reputation: 15345
I have the following simple function where I'm filling all the columns missing values with a 0 and then calculating the Standard deviation. I know I could use the describe function, but I wanted to use this one for my purpose.
def stdDevAllColumns(df: DataFrame): DataFrame = {
df.select(df.columns.map(c => df.select(c).na.fill(0).agg(stddev(c))): _*)
}
This compiles, but results in a failure when run:
<console>:143: error: overloaded method value select with alternatives:
[U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.sql.DataFrame)
df.select(df.columns.map(c => df.select(c).na.fill(0).agg(stddev(c))): _*)
Any suggestions on what this could point out to?
EDIT: There is a much simpler way to do this as below:
z.show(df.summary("stddev"))
But nevertheless, I would like to know what the problem is with my implementation function above?
Upvotes: 0
Views: 228
Reputation: 1220
The error is occuring because Spark can't find a method on DataFrame
with the correct signature. Sometimes it helps to break down the expression into smaller parts to see what is happening:
val df: DataFrame = ???
val fn: Array[DataFrame] = df.columns.map(c => df.select(c).na.fill(0).agg(stddev(c)))
def stdDevAllColumns(df: DataFrame): DataFrame = {
df.select(fn: _*) // compiler correctly complains
}
When we extract your select expression into a variable fn
the compiler correctly infers that it has type Array[DataFrame]
. There is no select
method on DataFrame
that takes this type as an input.
Upvotes: 1