Reputation: 523
I wanna search a string in my dataframe. But I keep getting error.
Each row of my data frame is something like the following:
["CRM, Product, Merchandising, Pricing & Planning Manager -LA MER EMEA (100M$NS) - CDI", "Promotions& Fragrance Manager CLINIQUE EMEA (500M$ NS)-CDI", "Sales representative in flagship doors", "Marketing Manager Latin America/ Indies/Caribbean (V.I.E ) Guerlain - Miami (20M$ NS)", "Product Manager L’Oréal Paris, Men Expert, Roger & Gallet (trainee) DPGP TR EMEA", "Business Analyst Biotherm&Helena Rubinstein (trainee) DPL EMEA", "Product Development Manager (trainee) International Marketing", "Sales woman in flagship doors -Guerlain", "Trade Marketing Manager Latin America/ Indies/Caribbean (V.I.E ) Guerlain - Miami (18M$ NS)", "Beauty Brand Manager", "CRM, Product, Merchandising, Pricing & Planning Manager -LA MER EMEA (100M$NS) - CDI", "Promotions& Fragrance Manager CLINIQUE EMEA (500M$ NS)-CDI", "Sales representative in flagship doors", "Trade Marketing Manager Latin America/ Indies/Caribbean (V.I.E ) Guerlain - Miami (18M$ NS)", "Product Manager L’Oréal Paris, Men Expert, Roger & Gallet (trainee) DPGP TR EMEA", "Business Analyst Biotherm&Helena Rubinstein (trainee) DPL EMEA", "Product Development Manager (trainee) International Marketing", "Sales woman in flagship doors -Guerlain", "Beauty Brand Manager", "CRM, Product, Merchandising, Pricing & Planning Manager -LA MER EMEA (100M$NS) - CDI", "Promotions& Fragrance Manager CLINIQUE EMEA (500M$ NS)-CDI", "Sales representative in flagship doors", "Trade Marketing Manager Latin America/ Indies/Caribbean (V.I.E ) Guerlain - Miami (18M$ NS)", "Product Manager L’Oréal Paris, Men Expert, Roger & Gallet (trainee) DPGP TR EMEA", "Business Analyst Biotherm&Helena Rubinstein (trainee) DPL EMEA", "Product Development Manager (trainee) International Marketing", "Sales woman in flagship doors -Guerlain", "Senior Beauty Brand Manager",]
I wanna filter all row contained the word intern
:
I used the following code:
data_select.filter(col("title").contains('intern,'))
But I get the following error:
AnalysisException: cannot resolve 'contains(
title, 'intern,')' due to data type mismatch: argument 1 requires string type, however, '
title' is of array<string> type.;
Upvotes: 0
Views: 140
Reputation: 9308
since you are matching with the substring in the array, one way to do it without changing schema is to cast the array of string into string to use rlike
.
from pyspark.sql import functions as F
df = df.filter(F.col('title').cast(StringType()).rlike('.*intern.*'))
df.show() # use show instead of collect. "collect" will collect all data into a single driver node. (cost heavy operation)
Upvotes: 1