Find maximum average using Dataframe API in spark

I've got a table:

Country Emission
England 100
England 200
Germany 150
Germany 170

and I want to calculate average of Emission for every country and then find a country with the biggest average (something like max(avg(Emission)). I would like to print the Country next to the calculated value, like: Germany 160.

I know that I can do it by calculating avg, sorting by avg and then print the first row, but sorting seems to be suboptimal. Can I ommit this?

This is the solution with sorting which I wrote:

spdf\
    .groupBy("Country")\
    .agg(f.avg("Emission").alias("Emission"))\
    .orderBy(f.desc("Emission"))\
    .limit(1)\
    .show()

and this is the query, which returns only max(avg(Emission))

spdf\
    .groupBy("Country")\
    .avg("Emission")\
    .agg(f.max("avg(Emission)").alias("Emission"))\
    .show()

Upvotes: 1

Answers (2)

mck

Reputation: 42352

You can also use window function instead of group by. The idea is also to construct a struct, get the maximum of that struct and expand its elements.

df2 = df.selectExpr('struct(Country, avg(Emission) over (partition by Country) avg_emission) s') \
        .selectExpr('max(s) s') \
        .select('s.*')

df2.show()
+-------+------------+
|Country|avg_emission|
+-------+------------+
|Germany|       160.0|
+-------+------------+

Upvotes: 1

Oli

Reputation: 10406

It's true that it is a shame to sort something just to get the maximum value. Therefore, what you can do is use the max function. Since you want to keep the name of the country and not just the maximum value of Emission, you can bind the two column together in a struct. The code would look like this:

from pyspark.sql import functions as F
data=[("England", 100), ("England", 200), ("Germany", 150), ("Germany", 170)]
df = spark.createDataFrame(data, ["Country", "Emission"])

df.groupBy("Country")\
  .agg(F.avg("Emission").alias("Emission"))\
  .select(F.max(F.struct("Emission", "Country")).alias("s"))\
  .select("s.*")\
  .show()

+--------+-------+
|Emission|Country|
+--------+-------+
|   160.0|Germany|
+--------+-------+

The first select selects the country with maximum Emission. The result is wrapped within a struct named "s". The second select unwraps the result.

Upvotes: 1

Find maximum average using Dataframe API in spark

Answers (2)

Related Questions