simtim
simtim

Reputation: 231

Calculate mean for several columns in Spark scala

I'm looking for a way to calculate some statistic e.g. mean over several selected columns in Spark using Scala. Given that data object is my Spark DataFrame, it's easy to calculate a mean for one column only e.g.

data.agg(avg("var1") as "mean var1").show

Also, we can easily calculate a mean cross-tabulated by values of some other columns e.g.:

data.groupBy("category").agg(avg("var1") as "mean_var1").show

But how can we calculate a mean for a List of columns in a DataFrame? I tried running something like this, but it didn't work:

scala> data.select("var1", "var2").mean().show
<console>:44: error: value mean is not a member of org.apache.spark.sql.DataFrame
       data.select("var1", "var2").mean().show
                                   ^

Upvotes: 2

Views: 7516

Answers (2)

David Urry
David Urry

Reputation: 825

If you already have the dataset you can do this:

ds.describe(s"age")

Which will return this:

    summary age  
    count   10.0 
    mean    53.3   
    stddev  11.6
    min     18.0
    max     92.0

Upvotes: 1

koiralo
koiralo

Reputation: 23099

This is what you need to do

import org.apache.spark.sql.functions._

import spark.implicits._
val df1 = Seq((1,2,3), (3,4,5), (1,2,4)).toDF("A", "B", "C")

data.select(data.columns.map(mean(_)): _*).show()

Output:

+------------------+------------------+------+
|            avg(A)|            avg(B)|avg(C)|
+------------------+------------------+------+
|1.6666666666666667|2.6666666666666665|   4.0|
+------------------+------------------+------+

This works for selected columns

data.select(Seq("A", "B").map(mean(_)): _*).show()

Output:

+------------------+------------------+
|            avg(A)|            avg(B)|
+------------------+------------------+
|1.6666666666666667|2.6666666666666665|
+------------------+------------------+

Hope this helps!

Upvotes: 7

Related Questions