Reputation: 13
I am trying to compute the variance on a GroupedData object in PySpark 2. Looking at http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData, I don't see any built-in functions for computing variance.
Is there an efficient way to compute the variance on a GroupedData object in PySpark2?
Here is example code of how I would compute the mean, min, and max on a GroupedData object, but I'm not sure how to compute the variance:
from pyspark.sql import *
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
columns = ['a', 'b']
vals = [('x', 3), ('x', 5), ('y', 1), ('y', 8), ('y', 4), ('z', 5), ('z', 7), ('z', 4), ('z', 9)]
df = spark.createDataFrame(vals, columns)
df.groupBy('a').agg(avg('b'), min('b'), max('b')).show()
The dataframe df
looks like:
+---+---+
| a| b|
+---+---+
| x| 3|
| x| 5|
| y| 1|
| y| 8|
| y| 4|
| z| 5|
| z| 7|
| z| 4|
| z| 9|
+---+---+
I would like to create a new dataframe similar to the following, showing the variance:
+---+--------+
| a| b_var|
+---+--------+
| x| 1.0000|
| y| 8.2222|
| z| 3.6875|
+---+--------+
Upvotes: 0
Views: 2023
Reputation: 215117
The built-in functions are here; There are two methods var_pop
and var_samp
in the pyspark.sql.functions
module calculating population variance and sample variance respectively, what you need is the var_pop function:
import pyspark.sql.functions as F
(df.groupBy("a").agg(
F.round(F.var_pop("b"), 2).alias("var_pop_b"),
F.round(F.var_samp("b"), 2).alias("var_samp_b")
)).show()
+---+---------+----------+
| a|var_pop_b|var_samp_b|
+---+---------+----------+
| x| 1.0| 2.0|
| z| 3.69| 4.92|
| y| 8.22| 12.33|
+---+---------+----------+
Upvotes: 2