Reputation: 417
I am wondering when to use agg when using the aggregation functions? Why do I need to use agg here?
df.agg(mean(col("a"))).alias("b")
Upvotes: 0
Views: 735
Reputation: 15258
According to official doc agg
:
Aggregate on the entire DataFrame without groups (shorthand for
df.groupBy.agg()
).
With an example :
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 5), ("C", 6)], ["key", "value"])
df.show()
+---+-----+
|key|value|
+---+-----+
| A| 1|
| A| 5|
| C| 6|
+---+-----+
All these 3 statements are equivalent - they compute the mean over the whole dataframe.
df.agg(F.mean("value")).show()
+----------+
|avg(value)|
+----------+
| 4.0|
+----------+
df.groupBy().agg(F.mean("value")).show()
+----------+
|avg(value)|
+----------+
| 4.0|
+----------+
df.select(F.mean("value")).show()
+----------+
|avg(value)|
+----------+
| 4.0|
+----------+
If you specify the groupeBy
, you change the output :
df.groupBy("key").agg(F.mean("value")).show()
+---+----------+
|key|avg(value)|
+---+----------+
| C| 6.0|
| A| 3.0|
+---+----------+
Upvotes: 1