Josh Ortega
Josh Ortega

Reputation: 179

Aggregating two columns with Pyspark

Learning Apache Spark through PySpark and having issues.

I have the following DF:

+----------+------------+-----------+----------------+
|   game_id|posteam_type|total_plays|total_touchdowns|
+----------+------------+-----------+----------------+
|2009092003|        home|         90|               3|
|2010091912|        home|         95|               0|
|2010112106|        home|         75|               0|
|2010121213|        home|         85|               3|
|2009092011|        null|          9|            null|
|2010110703|        null|          2|            null|
|2010112111|        null|          6|            null|
|2011100909|        home|        102|               3|
|2011120800|        home|         72|               2|
|2012010110|        home|         74|               6|
|2012110410|        home|         68|               1|
|2012120911|        away|         91|               2|
|2011103008|        null|          6|            null|
|2012111100|        null|          3|            null|
|2013092212|        home|         86|               6|
|2013112407|        home|         73|               4|
|2013120106|        home|         99|               3|
|2014090705|        home|         94|               3|
|2014101203|        home|         77|               4|
|2014102611|        home|        107|               6|
+----------+------------+-----------+----------------+

I'm attempting to find the average number of plays it takes to score a TD or sum(total_plays)/sum(total_touchdowns).

I figured out the code to get the sums but can't figure out how to get the total average:

plays = nfl_game_play.groupBy().agg({'total_plays': 'sum'}).collect()
touchdowns = nfl_game_play.groupBy().agg({'total_touchdowns',: 'sum'}).collect()

As you can see I tried storing each as a variable but beyond just remembering what each value is and manually doing it.

Upvotes: 2

Views: 49

Answers (1)

notNull
notNull

Reputation: 31480

Try with below code:

Example:

df.show()
#+-----------+----------------+
#|total_plays|total_touchdowns|
#+-----------+----------------+
#|         90|               3|
#|         95|               0|
#|          9|            null|
#+-----------+----------------+

from pyspark.sql.functions import *
total_avg=df.groupBy().agg(sum("total_plays")/sum("total_touchdowns")).collect()[0][0]
#64.66666666666667

Upvotes: 3

Related Questions