Why pyspark give the wrong value of variance?

Question

I have a registered table in pyspark.

+--------+-------+--------+------------+---------+-----------------+----------------------+
|order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|days_since_prior_order|
+--------+-------+--------+------------+---------+-----------------+----------------------+
| 2168274|      2|   prior|           1|        2|               11|                  null|
| 1501582|      2|   prior|           2|        5|               10|                    10|
| 1901567|      2|   prior|           3|        1|               10|                     3|
|  738281|      2|   prior|           4|        2|               10|                     8|
| 1673511|      2|   prior|           5|        3|               11|                     8|
| 1199898|      2|   prior|           6|        2|                9|                    13|
| 3194192|      2|   prior|           7|        2|               12|                    14|
|  788338|      2|   prior|           8|        1|               15|                    27|
| 1718559|      2|   prior|           9|        2|                9|                     8|
| 1447487|      2|   prior|          10|        1|               11|                     6|
| 1402090|      2|   prior|          11|        1|               10|                    30|
| 3186735|      2|   prior|          12|        1|                9|                    28|
| 3268552|      2|   prior|          13|        4|               11|                    30|
|  839880|      2|   prior|          14|        3|               10|                    13|
| 1492625|      2|   train|          15|        1|               11|                    30|
+--------+-------+--------+------------+---------+-----------------+----------------------+

I want to calculate the variance of days_since_prior_order, excluding the null value. The right value should be 97.91836734693878, which is given by hive and python. But my pyspark give me 105.45054945054943.

spark.sql("select variance(days_since_prior_order) from \
(select * from orders where user_id=2 and days_since_prior_order is not null ) ").show()

The original table data types are correct.

 |-- order_id: long (nullable = true)
 |-- user_id: long (nullable = true)
 |-- eval_set: string (nullable = true)
 |-- order_number: short (nullable = true)
 |-- order_dow: short (nullable = true)
 |-- order_hour_of_day: short (nullable = true)
 |-- days_since_prior_order: short (nullable = true)

titiro89 · Accepted Answer

Try to use the following function instead of pyspark.sql.functions.variance(col):

pyspark.sql.functions.var_pop(col)

Aggregate function: returns the population variance of the values in a group.

With your column data, var_pop gives me this result:

[Row(var_pop(days_since_prior_order)=97.91836734693877)]

The reason is that:

variance() and var_samp() are scaled by 1/(N-1)
var_pop() is scaled by 1/N

with N number of values selected.

See population and sample variance for a useful link.

Here you will find the docs of var_pop()

Why pyspark give the wrong value of variance?

Answers (1)

Related Questions