How to create a column contains words from a string?

Question

I have a dataset with a column(string) containing several titles. If I wanted to do an operation, for example, where I create a new column (colB) doing an operation with contains or like (eg like("%hello%") ), so that if it had discounts in the string it would fill with a discount in colB or of have hello in the string would fill with hello in colB. And if I had both, I would put them both separated by a comma.

Example of the desired table below

colA	colB
Hello World	hello
Hi World	null
Discounts for apples	discount
Check this discount!	discount
Hello World and discount!	discount, hello

How do I go around and create this kind of dataset? Many thanks in advance!

pltc · Accepted Answer

You can use concat_ws to add your text together

from pyspark.sql import functions as F

(df
    .withColumn('b', F.concat_ws(', ', F.array(
        F.when(F.lower('a').like('%hello%'), F.lit('hello')),
        F.when(F.lower('a').like('%discount%'), F.lit('discount')),
    )))
    .show(10, False)
)

+-------------------------+---------------+
|a                        |b              |
+-------------------------+---------------+
|Hello World              |hello          |
|Hi World                 |               |
|Discounts for apples     |discount       |
|Check this discount!     |discount       |
|Hello World and discount!|hello, discount|
+-------------------------+---------------+

How to create a column contains words from a string?

Answers (1)

Related Questions