Looping GroupBy in Pyspark

Question

I have a dataframe that has multiple Activity for a given Event like so:

+-------+--------+
|  Event|Activity|
+-------+--------+
|2646922|   15400|
|2646922|   15399|
|2646922|   90664|
|2646922|   11512|
+-------+--------+

And each Activity has multiple RESULT values like so:

+-------+--------+------+
|  Event|Activity|RESULT|
+-------+--------+------+
|2646922|   15400| 399.9|
|2646922|   15400| 400.1|
|2646922|   15400|   400|
|2646922|   15400|400.03|
|2646922|   15400|399.93|
|2646922|   15400| 400.1|
|2646922|   15400|399.99|
|2646922|   15400| 400.1|
|2646922|   15400|   400|
|2646922|   15400| 400.1|
|2646922|   15400| 400.1|
|2646922|   15400| 400.1|
|2646922|   15400|399.91|
|2646922|   15400|   400|
|2646922|   15400|400.13|
|2646922|   15400| 400.2|
|2646922|   15400|399.92|
|2646922|   15400|   400|
|2646922|   15400|400.28|
|2646922|   15400| 399.3|
+-------+--------+------+

I want to groupby Event and cycle aggregations through Activity. For example, I want to find the mean of RESULT in Activities 15399, 90664, 11512 in the Event=2646922 without Activity=15400. Then find the mean of RESULT in Activities 15400, 90664, 11512 without Activity=15399 and so on.

Typically in Python we would use a for loop for each Event. Is there an efficient way to do this in PySpark? Would I have to use a Dictionary or Tuples?

Would appreciate any insight. Thank You

Steven · Accepted Answer

assuming df is your dataframe.

I want to find the mean of RESULT in Activities 15399, 90664, 11512 in the Event=2646922 without Activity=15400

from pyspark.sql import functions as F

df.where(
    "Event=2646922 and Activities in (15399, 90664, 11512)"
).groupBy("Event").agg(F.mean("RESULT"))

find the mean of RESULT in Activities 15400, 90664, 11512 without Activity=15399

df.where(
    "Event=2646922 and Activities in (15400, 90664, 11512)"
).groupBy("Event").agg(F.mean("RESULT"))

If you want to exclude each activity number one after the other, you need to create a cartesion join with condition left.event = right.event and left.activities != right.activites. Therefore, you will have for each activity number, all the RESULT of the other activities, and you can perform a simple group by.

If I use your sample dataframes, the first one is df1 and the second one df2. As you said it is one big dataframe, you should apply distinct to create df1 and the apply my code.

df1.join(
    df2, on="event"
).where(
    df1.activities != df2.activities
).groupBy("Event", df1.activities).agg(F.avg("result"))

Looping GroupBy in Pyspark

Answers (1)

Related Questions