Function to filter values in PySpark

Question

I'm trying to run a for loop in PySpark that needs a to filter a variable for an algorithm.

Here's an example of my dataframe df_prods:

+----------+--------------------+--------------------+
|ID        |        NAME        |           TYPE     |
+----------+--------------------+--------------------+
|    7983  |SNEAKERS 01         |            Sneakers|
|    7034  |SHIRT 13            |               Shirt|
|    3360  |SHORTS 15           |               Short|

I want to iterate over a list of ID's, get the match from the algorithm and then filter the product's type.

I created a function that gets the type:

def get_type(ID_PROD):
    return [row[0] for row in df_prods.filter(df_prods.ID == ID_PROD).select("TYPE").collect()]

And wanted it to return:

print(get_type(7983))
Sneakers

But I find two issues:
1- it takes a long time to do that (longer than I got doing a similar thing on Python)
2- It returns an string array type: ['Sneakers'] and when I try to filter the products, this happens:

type = get_type(7983)
df_prods.filter(df_prods.type == type)
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [Sneakers]

Does anyone know a better way to approach this on PySpark?

Thank you very much in advance. I'm having a very hard time learning PySpark.

kfkhalili · Accepted Answer

A little adjustment on your function. This returns the actual string of the target column from the first record found after filtering.

from pyspark.sql.functions import col

def get_type(ID_PROD):
  return df.filter(col("ID") == ID_PROD).select("TYPE").collect()[0]["TYPE"]

type = get_type(7983)
df_prods.filter(col("TYPE") == type) # works

I find using col("colname") to be much more readable.

About the performance issue you've mentioned, I really cannot say without more details (e.g. inspecting the data and the rest of your application). Try this syntax and tell me if the performance improves.

Function to filter values in PySpark

Answers (1)

Related Questions