Reputation: 231
I'm looking for a way to calculate some statistic e.g. mean over several selected columns in Spark using Scala. Given that data
object is my Spark DataFrame
, it's easy to calculate a mean for one column only e.g.
data.agg(avg("var1") as "mean var1").show
Also, we can easily calculate a mean cross-tabulated by values of some other columns e.g.:
data.groupBy("category").agg(avg("var1") as "mean_var1").show
But how can we calculate a mean for a List of columns in a DataFrame? I tried running something like this, but it didn't work:
scala> data.select("var1", "var2").mean().show
<console>:44: error: value mean is not a member of org.apache.spark.sql.DataFrame
data.select("var1", "var2").mean().show
^
Upvotes: 2
Views: 7516
Reputation: 825
If you already have the dataset you can do this:
ds.describe(s"age")
Which will return this:
summary age
count 10.0
mean 53.3
stddev 11.6
min 18.0
max 92.0
Upvotes: 1
Reputation: 23099
This is what you need to do
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1,2,3), (3,4,5), (1,2,4)).toDF("A", "B", "C")
data.select(data.columns.map(mean(_)): _*).show()
Output:
+------------------+------------------+------+
| avg(A)| avg(B)|avg(C)|
+------------------+------------------+------+
|1.6666666666666667|2.6666666666666665| 4.0|
+------------------+------------------+------+
This works for selected columns
data.select(Seq("A", "B").map(mean(_)): _*).show()
Output:
+------------------+------------------+
| avg(A)| avg(B)|
+------------------+------------------+
|1.6666666666666667|2.6666666666666665|
+------------------+------------------+
Hope this helps!
Upvotes: 7