Filter rows in Spark dataframe from the words in RDD

Question

I have the following commands in spark,

data = sqlContext.sql("select column1, column2, column3 from table_name")

words = sc.textFile("words.txt")

words.txt has a bunch of words and data has three string columns taken from table_name.

Now I want to filter out rows in data (spark dataframe) whenever the word pattern of each word from words.txt occurs in any of the three columns of data.

For example if words.txt has word such as gon and if any of the three columns of data contains values as bygone, gone etc, I want to filter out that row.

I've tried the following:

data.filter(~data['column1'].like('%gon%') | data['column2'].like('%gon%') | data['column3'].like('%gon%')).toPandas()

This works for one word. But I want to check all the words from the words.txt and remove it. Is there a way to do this?

I am new to PySpark. Any suggestions would be helpful.

Filter rows in Spark dataframe from the words in RDD

Answers (1)

Related Questions