Reputation: 3753
I have a registered table in pyspark.
+--------+-------+--------+------------+---------+-----------------+----------------------+
|order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|days_since_prior_order|
+--------+-------+--------+------------+---------+-----------------+----------------------+
| 2168274| 2| prior| 1| 2| 11| null|
| 1501582| 2| prior| 2| 5| 10| 10|
| 1901567| 2| prior| 3| 1| 10| 3|
| 738281| 2| prior| 4| 2| 10| 8|
| 1673511| 2| prior| 5| 3| 11| 8|
| 1199898| 2| prior| 6| 2| 9| 13|
| 3194192| 2| prior| 7| 2| 12| 14|
| 788338| 2| prior| 8| 1| 15| 27|
| 1718559| 2| prior| 9| 2| 9| 8|
| 1447487| 2| prior| 10| 1| 11| 6|
| 1402090| 2| prior| 11| 1| 10| 30|
| 3186735| 2| prior| 12| 1| 9| 28|
| 3268552| 2| prior| 13| 4| 11| 30|
| 839880| 2| prior| 14| 3| 10| 13|
| 1492625| 2| train| 15| 1| 11| 30|
+--------+-------+--------+------------+---------+-----------------+----------------------+
I want to calculate the variance of days_since_prior_order
, excluding the null value. The right value should be 97.91836734693878, which is given by hive and python. But my pyspark give me 105.45054945054943.
spark.sql("select variance(days_since_prior_order) from \
(select * from orders where user_id=2 and days_since_prior_order is not null ) ").show()
The original table data types are correct.
|-- order_id: long (nullable = true)
|-- user_id: long (nullable = true)
|-- eval_set: string (nullable = true)
|-- order_number: short (nullable = true)
|-- order_dow: short (nullable = true)
|-- order_hour_of_day: short (nullable = true)
|-- days_since_prior_order: short (nullable = true)
Upvotes: 0
Views: 1506
Reputation: 2108
Try to use the following function instead of pyspark.sql.functions.variance(col):
pyspark.sql.functions.var_pop(col)
Aggregate function: returns the population variance of the values in a group.
With your column data, var_pop gives me this result:
[Row(var_pop(days_since_prior_order)=97.91836734693877)]
The reason is that:
with N number of values selected.
See population and sample variance for a useful link.
Here you will find the docs of var_pop()
Upvotes: 2