M. Page
M. Page

Reputation: 2814

Spark SQL DataFrame HAVING

Using pyspark, I have a Spark 2.2 DataFrame df with schema: country: String, year: Integer, x: Float I want the average value of x over years for each country, for countries with AVG(x) > 10. The following is working:

groups = df.groupBy(df.country).agg(avg('x').alias('avg_x'))
groups.filter(groups.avg_x > 10)

But I am bothered to have to define the useless groups variable.

I have tried:

df.groupBy(df.country).agg(avg('x').alias('avg_x')).filter(df.avg_x > 10)

But this results in: AttributeError: 'DataFrame' object has no attribute 'avg_x'

Upvotes: 1

Views: 3132

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

Don't use column bounded to a DataFrame (which just doesn't have avg_x):

from pyspark.sql.functions import col

df.groupBy(df.country).agg(avg('x').alias('avg_x')).filter(col("avg_x") > 10)

or

df.groupBy(df.country).agg(avg('x').alias('avg_x')).filter("avg_x > 10")

Upvotes: 1

Related Questions