Reputation:
I have a result DF that looks like this:
+--------------------+----------+
| name|prediction|
+--------------------+----------+
| "Mazda RX4"| 0|
| "Mazda RX4 Wag"| 0|
| "Datsun 710"| 1|
| "Hornet 4 Drive"| 0|
| "Hornet Sportabout"| 2|
| "Valiant"| 0|
| "Duster 360"| 2|
| "Merc 240D"| 1|
| "Merc 230"| 1|
| "Merc 280"| 0|
| "Merc 450SE"| 3|
| "Merc 450SL"| 3|
+--------------------+----------+
I want to get a list of lists where each list contains the names of similar predictions. So one list would be:
["Mazda RX4", "Mazda RX4 Wag", "Hornet 4 Drive", "Valiant", "Merc 230"]
I've tried doing result.groupBy('prediction').collect()
but didn't work. And I also cannot iterate DF with a loop. Please help.
Upvotes: 2
Views: 86
Reputation: 31490
Try with filter and then groupBy+ aggregate
from pyspark.sql.functions import *
df.\
filter(col("prediction")==0).\
groupBy("prediction").\
agg(collect_list(col("name"))).\
collect()[0][1]
#result
#["Mazda RX4", "Mazda RX4 Wag", "Hornet 4 Drive", "Valiant", "Merc 230"]
Upvotes: 1