user
user

Reputation: 771

How to use .contains() in PySpark to filter by single or multiple substrings?

This is a simple question (I think) but I'm not sure the best way to answer it.

I need to filter based on presence of "substrings" in a column containing strings in a Spark Dataframe.

Currently I am doing the following (filtering using .contains):

data = [[1, "ABC"], [2, "DEF"], [3, "DEF"], [4, "JFKLMN"], [5, "JFKOPQ"], [6, "MNO"]]

df = pd.DataFrame(data, columns=["Item", "String"])
sdf = spark.createDataFrame(df)

but I want generalize this so I can filter to one or more strings like below:

filtered_sdf = sdf.filter(
    spark_fns.col("String").contains("JFK") | spark_fns.col("String").contains("ABC")
) 

or

filtered_sdf = sdf.filter(
    spark_fns.col("String").contains("ABC")
)

where ideally, the .contains() portion is a pre-set parameter that contains 1+ substrings.

Does anyone know what the best way to do this would be? Or an alternative method?

I've tried using .isin(substring_list) but it doesn't work because we are searching for presence of substrings.

Upvotes: 6

Views: 17722

Answers (2)

jbmgo
jbmgo

Reputation: 81

You could create a regex pattern that fits all your desired patterns:

list_desired_patterns = ["ABC", "JFK"]
regex_pattern = "|".join(list_desired_patterns)

Then apply the rlike Column method:

filtered_sdf = sdf.filter(
    spark_fns.col("String").rlike(regex_pattern)
)

This will filter any match within the list of desired patterns.

Upvotes: 8

user
user

Reputation: 771

Using regex matching solves the problem!

filtered_sdf = sdf.filter(
    spark_fns.col("String").contains("JFK") | spark_fns.col("String").rlike("ABC")
) 

Upvotes: 0

Related Questions