Reputation: 1256
I have a dataframe that has multiple Activity
for a given Event
like so:
+-------+--------+
| Event|Activity|
+-------+--------+
|2646922| 15400|
|2646922| 15399|
|2646922| 90664|
|2646922| 11512|
+-------+--------+
And each Activity
has multiple RESULT
values like so:
+-------+--------+------+
| Event|Activity|RESULT|
+-------+--------+------+
|2646922| 15400| 399.9|
|2646922| 15400| 400.1|
|2646922| 15400| 400|
|2646922| 15400|400.03|
|2646922| 15400|399.93|
|2646922| 15400| 400.1|
|2646922| 15400|399.99|
|2646922| 15400| 400.1|
|2646922| 15400| 400|
|2646922| 15400| 400.1|
|2646922| 15400| 400.1|
|2646922| 15400| 400.1|
|2646922| 15400|399.91|
|2646922| 15400| 400|
|2646922| 15400|400.13|
|2646922| 15400| 400.2|
|2646922| 15400|399.92|
|2646922| 15400| 400|
|2646922| 15400|400.28|
|2646922| 15400| 399.3|
+-------+--------+------+
I want to groupby Event
and cycle aggregations through Activity
.
For example, I want to find the mean of RESULT
in Activities 15399, 90664, 11512
in the Event=2646922
without Activity=15400
.
Then find the mean of RESULT
in Activities 15400, 90664, 11512
without Activity=15399
and so on.
Typically in Python we would use a for loop for each Event. Is there an efficient way to do this in PySpark? Would I have to use a Dictionary or Tuples?
Would appreciate any insight. Thank You
Upvotes: 0
Views: 1423
Reputation: 15258
assuming df
is your dataframe.
I want to find the mean of RESULT in Activities 15399, 90664, 11512 in the Event=2646922 without Activity=15400
from pyspark.sql import functions as F
df.where(
"Event=2646922 and Activities in (15399, 90664, 11512)"
).groupBy("Event").agg(F.mean("RESULT"))
find the mean of RESULT in Activities 15400, 90664, 11512 without Activity=15399
df.where(
"Event=2646922 and Activities in (15400, 90664, 11512)"
).groupBy("Event").agg(F.mean("RESULT"))
If you want to exclude each activity number one after the other, you need to create a cartesion join with condition left.event = right.event and left.activities != right.activites
. Therefore, you will have for each activity number, all the RESULT of the other activities, and you can perform a simple group by.
If I use your sample dataframes, the first one is df1
and the second one df2
. As you said it is one big dataframe, you should apply distinct to create df1
and the apply my code.
df1.join(
df2, on="event"
).where(
df1.activities != df2.activities
).groupBy("Event", df1.activities).agg(F.avg("result"))
Upvotes: 1