Reputation: 107
I have a dataset with a column(string) containing several titles. If I wanted to do an operation, for example, where I create a new column (colB) doing an operation with contains or like (eg like("%hello%")
), so that if it had discounts in the string it would fill with a discount in colB or of have hello in the string would fill with hello in colB. And if I had both, I would put them both separated by a comma.
Example of the desired table below
colA | colB |
---|---|
Hello World | hello |
Hi World | null |
Discounts for apples | discount |
Check this discount! | discount |
Hello World and discount! | discount, hello |
How do I go around and create this kind of dataset? Many thanks in advance!
Upvotes: 0
Views: 148
Reputation: 6082
You can use concat_ws
to add your text together
from pyspark.sql import functions as F
(df
.withColumn('b', F.concat_ws(', ', F.array(
F.when(F.lower('a').like('%hello%'), F.lit('hello')),
F.when(F.lower('a').like('%discount%'), F.lit('discount')),
)))
.show(10, False)
)
+-------------------------+---------------+
|a |b |
+-------------------------+---------------+
|Hello World |hello |
|Hi World | |
|Discounts for apples |discount |
|Check this discount! |discount |
|Hello World and discount!|hello, discount|
+-------------------------+---------------+
Upvotes: 1