Manrique
Manrique

Reputation: 2221

Check for list of substrings inside string column in PySpark

For checking if a single string is contained in rows of one column. (for example, "abc" is contained in "abcdef"), the following code is useful:

df_filtered = df.filter(df.columnName.contains('abc'))

The result would be for example "_wordabc","thisabce","2abc1".

How can I check for multiple strings (for example ['ab1','cd2','ef3']) at the same time?

I'm ideally searching for something like this:

df_filtered = df.filter(df.columnName.contains(['word1','word2','word3']))

The result would be for example "x_ab1","_cd2_","abef3".

Please, post scalable solutions (no for loops, for example) because the aim is to check a big list around 1000 elements.

Upvotes: 0

Views: 1864

Answers (1)

User12345
User12345

Reputation: 5480

All you need is isin

df_filtered = df.filter(df['columnName'].isin('word1','word2','word3') 

Edit

You need rlike function to achieve your result

words="(aaa|bbb|ccc)"

df.filter(df['columnName'].rlike(words))

Upvotes: 2

Related Questions