Reputation: 83
I'm trying to run a for loop in PySpark that needs a to filter a variable for an algorithm.
Here's an example of my dataframe df_prods:
+----------+--------------------+--------------------+
|ID | NAME | TYPE |
+----------+--------------------+--------------------+
| 7983 |SNEAKERS 01 | Sneakers|
| 7034 |SHIRT 13 | Shirt|
| 3360 |SHORTS 15 | Short|
I want to iterate over a list of ID's, get the match from the algorithm and then filter the product's type.
I created a function that gets the type:
def get_type(ID_PROD):
return [row[0] for row in df_prods.filter(df_prods.ID == ID_PROD).select("TYPE").collect()]
And wanted it to return:
print(get_type(7983))
Sneakers
But I find two issues:
1- it takes a long time to do that (longer than I got doing a similar thing on Python)
2- It returns an string array type: ['Sneakers'] and when I try to filter the products, this happens:
type = get_type(7983)
df_prods.filter(df_prods.type == type)
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [Sneakers]
Does anyone know a better way to approach this on PySpark?
Thank you very much in advance. I'm having a very hard time learning PySpark.
Upvotes: 0
Views: 683
Reputation: 1035
A little adjustment on your function. This returns the actual string of the target column from the first record found after filtering.
from pyspark.sql.functions import col
def get_type(ID_PROD):
return df.filter(col("ID") == ID_PROD).select("TYPE").collect()[0]["TYPE"]
type = get_type(7983)
df_prods.filter(col("TYPE") == type) # works
I find using col("colname")
to be much more readable.
About the performance issue you've mentioned, I really cannot say without more details (e.g. inspecting the data and the rest of your application). Try this syntax and tell me if the performance improves.
Upvotes: 1