Reputation: 2814
Using pyspark, I have a Spark 2.2 DataFrame df
with schema: country: String, year: Integer, x: Float
I want the average value of x
over years for each country, for countries with AVG(x) > 10
.
The following is working:
groups = df.groupBy(df.country).agg(avg('x').alias('avg_x'))
groups.filter(groups.avg_x > 10)
But I am bothered to have to define the useless groups
variable.
I have tried:
df.groupBy(df.country).agg(avg('x').alias('avg_x')).filter(df.avg_x > 10)
But this results in: AttributeError: 'DataFrame' object has no attribute 'avg_x'
Upvotes: 1
Views: 3132
Reputation: 35229
Don't use column bounded to a DataFrame
(which just doesn't have avg_x
):
from pyspark.sql.functions import col
df.groupBy(df.country).agg(avg('x').alias('avg_x')).filter(col("avg_x") > 10)
or
df.groupBy(df.country).agg(avg('x').alias('avg_x')).filter("avg_x > 10")
Upvotes: 1