user15649753
user15649753

Reputation: 523

how search a word in each row of my dataframe?

I wanna search a string in my dataframe. But I keep getting error.

Each row of my data frame is something like the following:

["CRM, Product, Merchandising, Pricing & Planning Manager -LA MER EMEA (100M$NS) - CDI", "Promotions& Fragrance Manager CLINIQUE EMEA (500M$ NS)-CDI", "Sales representative in flagship doors", "Marketing Manager Latin America/ Indies/Caribbean (V.I.E ) Guerlain - Miami (20M$ NS)", "Product Manager L’Oréal Paris, Men Expert, Roger & Gallet (trainee) DPGP TR EMEA", "Business Analyst Biotherm&Helena Rubinstein (trainee) DPL EMEA", "Product Development Manager (trainee) International Marketing", "Sales woman in flagship doors -Guerlain", "Trade Marketing Manager Latin America/ Indies/Caribbean (V.I.E ) Guerlain - Miami (18M$ NS)", "Beauty Brand Manager", "CRM, Product, Merchandising, Pricing & Planning Manager -LA MER EMEA (100M$NS) - CDI", "Promotions& Fragrance Manager CLINIQUE EMEA (500M$ NS)-CDI", "Sales representative in flagship doors", "Trade Marketing Manager Latin America/ Indies/Caribbean (V.I.E ) Guerlain - Miami (18M$ NS)", "Product Manager L’Oréal Paris, Men Expert, Roger & Gallet (trainee) DPGP TR EMEA", "Business Analyst Biotherm&Helena Rubinstein (trainee) DPL EMEA", "Product Development Manager (trainee) International Marketing", "Sales woman in flagship doors -Guerlain", "Beauty Brand Manager", "CRM, Product, Merchandising, Pricing & Planning Manager -LA MER EMEA (100M$NS) - CDI", "Promotions& Fragrance Manager CLINIQUE EMEA (500M$ NS)-CDI", "Sales representative in flagship doors", "Trade Marketing Manager Latin America/ Indies/Caribbean (V.I.E ) Guerlain - Miami (18M$ NS)", "Product Manager L’Oréal Paris, Men Expert, Roger & Gallet (trainee) DPGP TR EMEA", "Business Analyst Biotherm&Helena Rubinstein (trainee) DPL EMEA", "Product Development Manager (trainee) International Marketing", "Sales woman in flagship doors -Guerlain", "Senior Beauty Brand Manager",]

I wanna filter all row contained the word intern:

I used the following code:

data_select.filter(col("title").contains('intern,')) But I get the following error:

AnalysisException: cannot resolve 'contains(title, 'intern,')' due to data type mismatch: argument 1 requires string type, however, 'title' is of array<string> type.;

Upvotes: 0

Views: 140

Answers (1)

Emma
Emma

Reputation: 9308

since you are matching with the substring in the array, one way to do it without changing schema is to cast the array of string into string to use rlike.

from pyspark.sql import functions as F
df = df.filter(F.col('title').cast(StringType()).rlike('.*intern.*'))
df.show()  # use show instead of collect. "collect" will collect all data into a single driver node. (cost heavy operation) 

Upvotes: 1

Related Questions