Natalia M
Natalia M

Reputation: 83

Function to filter values in PySpark

I'm trying to run a for loop in PySpark that needs a to filter a variable for an algorithm.

Here's an example of my dataframe df_prods:

+----------+--------------------+--------------------+
|ID        |        NAME        |           TYPE     |
+----------+--------------------+--------------------+
|    7983  |SNEAKERS 01         |            Sneakers|
|    7034  |SHIRT 13            |               Shirt|
|    3360  |SHORTS 15           |               Short|

I want to iterate over a list of ID's, get the match from the algorithm and then filter the product's type.

I created a function that gets the type:

def get_type(ID_PROD):
    return [row[0] for row in df_prods.filter(df_prods.ID == ID_PROD).select("TYPE").collect()]

And wanted it to return:

print(get_type(7983))
Sneakers

But I find two issues:
1- it takes a long time to do that (longer than I got doing a similar thing on Python)
2- It returns an string array type: ['Sneakers'] and when I try to filter the products, this happens:

type = get_type(7983)
df_prods.filter(df_prods.type == type)
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [Sneakers]

Does anyone know a better way to approach this on PySpark?

Thank you very much in advance. I'm having a very hard time learning PySpark.

Upvotes: 0

Views: 683

Answers (1)

kfkhalili
kfkhalili

Reputation: 1035

A little adjustment on your function. This returns the actual string of the target column from the first record found after filtering.

from pyspark.sql.functions import col

def get_type(ID_PROD):
  return df.filter(col("ID") == ID_PROD).select("TYPE").collect()[0]["TYPE"]

type = get_type(7983)
df_prods.filter(col("TYPE") == type) # works

I find using col("colname") to be much more readable.

About the performance issue you've mentioned, I really cannot say without more details (e.g. inspecting the data and the rest of your application). Try this syntax and tell me if the performance improves.

Upvotes: 1

Related Questions