Extracting data out of pyspark DataFrame

Question

I have a result DF that looks like this:

+--------------------+----------+
|                name|prediction|
+--------------------+----------+
|         "Mazda RX4"|         0|
|     "Mazda RX4 Wag"|         0|
|        "Datsun 710"|         1|
|    "Hornet 4 Drive"|         0|
| "Hornet Sportabout"|         2|
|           "Valiant"|         0|
|        "Duster 360"|         2|
|         "Merc 240D"|         1|
|          "Merc 230"|         1|
|          "Merc 280"|         0|
|        "Merc 450SE"|         3|
|        "Merc 450SL"|         3|
+--------------------+----------+

I want to get a list of lists where each list contains the names of similar predictions. So one list would be:

["Mazda RX4", "Mazda RX4 Wag", "Hornet 4 Drive",  "Valiant", "Merc 230"]

I've tried doing result.groupBy('prediction').collect() but didn't work. And I also cannot iterate DF with a loop. Please help.

notNull · Accepted Answer

Try with filter and then groupBy+ aggregate

from pyspark.sql.functions import *

df.\
filter(col("prediction")==0).\
groupBy("prediction").\
agg(collect_list(col("name"))).\
collect()[0][1]

#result
#["Mazda RX4", "Mazda RX4 Wag", "Hornet 4 Drive",  "Valiant", "Merc 230"]

Extracting data out of pyspark DataFrame

Answers (1)

Related Questions