Reputation: 771
This is a simple question (I think) but I'm not sure the best way to answer it.
I need to filter based on presence of "substrings" in a column containing strings in a Spark Dataframe.
Currently I am doing the following (filtering using .contains
):
data = [[1, "ABC"], [2, "DEF"], [3, "DEF"], [4, "JFKLMN"], [5, "JFKOPQ"], [6, "MNO"]]
df = pd.DataFrame(data, columns=["Item", "String"])
sdf = spark.createDataFrame(df)
but I want generalize this so I can filter to one or more strings like below:
filtered_sdf = sdf.filter(
spark_fns.col("String").contains("JFK") | spark_fns.col("String").contains("ABC")
)
or
filtered_sdf = sdf.filter(
spark_fns.col("String").contains("ABC")
)
where ideally, the .contains()
portion is a pre-set parameter that contains 1+ substrings.
Does anyone know what the best way to do this would be? Or an alternative method?
I've tried using .isin(substring_list)
but it doesn't work because we are searching for presence of substrings.
Upvotes: 6
Views: 17722
Reputation: 81
You could create a regex pattern that fits all your desired patterns:
list_desired_patterns = ["ABC", "JFK"]
regex_pattern = "|".join(list_desired_patterns)
Then apply the rlike Column method:
filtered_sdf = sdf.filter(
spark_fns.col("String").rlike(regex_pattern)
)
This will filter any match within the list of desired patterns.
Upvotes: 8
Reputation: 771
Using regex matching solves the problem!
filtered_sdf = sdf.filter(
spark_fns.col("String").contains("JFK") | spark_fns.col("String").rlike("ABC")
)
Upvotes: 0