Youshikyou
Youshikyou

Reputation: 417

pyspark dataframe api agg() function

I am wondering when to use agg when using the aggregation functions? Why do I need to use agg here?

df.agg(mean(col("a"))).alias("b")

Upvotes: 0

Views: 735

Answers (1)

Steven
Steven

Reputation: 15258

According to official doc agg:

Aggregate on the entire DataFrame without groups (shorthand for df.groupBy.agg()).

With an example :

from pyspark.sql import functions as F

df = spark.createDataFrame([("A", 1), ("A", 5), ("C", 6)], ["key", "value"])

df.show()
+---+-----+                                                                     
|key|value|
+---+-----+
|  A|    1|
|  A|    5|
|  C|    6|
+---+-----+

All these 3 statements are equivalent - they compute the mean over the whole dataframe.

df.agg(F.mean("value")).show()
+----------+
|avg(value)|
+----------+
|       4.0|
+----------+
df.groupBy().agg(F.mean("value")).show()
+----------+
|avg(value)|
+----------+
|       4.0|
+----------+
df.select(F.mean("value")).show()
+----------+
|avg(value)|
+----------+
|       4.0|
+----------+

If you specify the groupeBy, you change the output :

df.groupBy("key").agg(F.mean("value")).show()
+---+----------+                                                                
|key|avg(value)|
+---+----------+
|  C|       6.0|
|  A|       3.0|
+---+----------+

Upvotes: 1

Related Questions