thetna
thetna

Reputation: 7143

Pyspark: Add the average as a new column to DataFrame

I am computing mean of a column in data-frame but it resulted in all the values zeros. Can someone help me in why this is happening? Following is the code and table before and after the transformation of a column.

Before computing mean and adding "mean" column

result.select("dis_price_released").show(10)
 +------------------+
 |dis_price_released|
 +------------------+
 |               0.0|
 |               4.0|
 |               4.0|
 |               4.0|
 |               1.0|
 |               4.0|
 |               4.0|
 |               0.0|
 |               4.0|
 |               0.0|
 +------------------+

After computing mean and adding mean column

w = Window().partitionBy("dis_price_released").rowsBetween(-sys.maxsize, sys.maxsize)
df2 = result.withColumn("mean", avg("dis_price_released").over(w))
df2.select("dis_price_released", "mean").show(10)

+------------------+----+
|dis_price_released|mean|
+------------------+----+
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
+------------------+----+

Upvotes: 2

Views: 11516

Answers (3)

Justin
Justin

Reputation: 211

The issue is that if you have a column you wish to calculate an average for across all rows, you should not partition by any column at all. Neither do you need to set rowsBetween in this case. Thus (assuming correct imports and the existence of the results DataFrame) your code should read:

w = Window().partitionBy()
df2 = result.withColumn("mean", avg("dis_price_released").over(w))
df2.select("dis_price_released", "mean").show(10)

Upvotes: 0

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

This is yet another way to solve the problem

df.withColumn("mean", lit(df.select(avg("dis_price_released").as("temp")).first().getAs("temp"))).show

Upvotes: 0

mtoto
mtoto

Reputation: 24198

You can compute the avg first for the whole column, then use lit() to add it as a variable to your DataFrame, there is no need for window functions:

from pyspark.sql.functions import lit

mean = df.groupBy().avg("dis_price_released").take(1)[0][0]
df.withColumn("test", lit(mean)).show()
 +------------------+----+
|dis_price_released|test|
+------------------+----+
|               0.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               1.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               0.0| 2.5|
|               4.0| 2.5|
|               0.0| 2.5|
+------------------+----+

Upvotes: 2

Related Questions