Reputation: 11
I've got a table:
Country Emission
England 100
England 200
Germany 150
Germany 170
and I want to calculate average of Emission for every country and then find a country with the biggest average (something like max(avg(Emission))
. I would like to print the Country next to the calculated value, like:
Germany 160
.
I know that I can do it by calculating avg
, sorting by avg
and then print the first row, but sorting seems to be suboptimal. Can I ommit this?
This is the solution with sorting which I wrote:
spdf\
.groupBy("Country")\
.agg(f.avg("Emission").alias("Emission"))\
.orderBy(f.desc("Emission"))\
.limit(1)\
.show()
and this is the query, which returns only max(avg(Emission))
spdf\
.groupBy("Country")\
.avg("Emission")\
.agg(f.max("avg(Emission)").alias("Emission"))\
.show()
Upvotes: 1
Views: 979
Reputation: 42352
You can also use window function instead of group by. The idea is also to construct a struct, get the maximum of that struct and expand its elements.
df2 = df.selectExpr('struct(Country, avg(Emission) over (partition by Country) avg_emission) s') \
.selectExpr('max(s) s') \
.select('s.*')
df2.show()
+-------+------------+
|Country|avg_emission|
+-------+------------+
|Germany| 160.0|
+-------+------------+
Upvotes: 1
Reputation: 10406
It's true that it is a shame to sort something just to get the maximum value. Therefore, what you can do is use the max
function. Since you want to keep the name of the country and not just the maximum value of Emission, you can bind the two column together in a struct
. The code would look like this:
from pyspark.sql import functions as F
data=[("England", 100), ("England", 200), ("Germany", 150), ("Germany", 170)]
df = spark.createDataFrame(data, ["Country", "Emission"])
df.groupBy("Country")\
.agg(F.avg("Emission").alias("Emission"))\
.select(F.max(F.struct("Emission", "Country")).alias("s"))\
.select("s.*")\
.show()
+--------+-------+
|Emission|Country|
+--------+-------+
| 160.0|Germany|
+--------+-------+
The first select
selects the country with maximum Emission. The result is wrapped within a struct named "s"
. The second select unwraps the result.
Upvotes: 1