Layale
Layale

Reputation: 163

Pyspark create multiple columns under condition with string matching from list

I have a dataframe like this

+---+-------------------------+-------+
id  |     sentence            | value |
+---+-------------------------+-------+
  1 |dogs love fruits         |     50|
  1 |Mary is in the kitchen   |     45|
  2 |John eats fruits         |     56|
+---+-------------------------+-------+

I have a list of words, and I would like to create multiple columns based on sting matching, the output would look like this

List = ["fruits", "kitchen"]
+---+-------------------------+-------+-----------+------------+------------+-------------+
id  |     sentence            | value |fruits_bool|fruits_value|kitchen_bool|kitchen_value
+---+-------------------------+-------+-----------+------------+------------+-------------+
  1 |dogs love fruits         |     50|1          |50          |0           |0 
  1 |Mary is in the kitchen   |     45|0          |0           |1           |45
  2 |John eats fruits         |     56|1          |56          |0           |0
+---+-------------------------+-------+-----------+------------+------------+-------------+

I can easily create these columns word by word, but I would like to do automatically based on the list. Here's the code I tried

groupby("id"). \
    pivot('sentence', List). \
    agg(
       F.count('sentence').alias('bool')
       F.max('value').alias('value')
     ). \
    fillna(0)

I get an empty dataframe. What is the best way to do it ?

Upvotes: 1

Views: 211

Answers (2)

Siddhant Tandon
Siddhant Tandon

Reputation: 701

You can use F.contains and cast it to integer

df\
.withColumn('fruits_bool',F.col('sentence').contains(F.lit('fruits')).cast('integer'))\
.withColumn('fruits_value',F.col('fruits_bool')*F.col('value'))\
.withColumn('kitchen_bool',F.col('sentence').contains(F.lit('kitchen')).cast('integer'))\
.withColumn('kitchen_value',F.col('kitchen_bool')*F.col('value')).show()

output :

+---+--------------------+-----+-----------+------------+------------+-------------+
| id|            sentence|value|fruits_bool|fruits_value|kitchen_bool|kitchen_value|
+---+--------------------+-----+-----------+------------+------------+-------------+
|  1|    dogs love fruits|   50|          1|          50|           0|            0|
|  2|Mary is in the ki...|   45|          0|           0|           1|           45|
|  3|    John eats fruits|   56|          1|          56|           0|            0|
+---+--------------------+-----+-----------+------------+------------+-------------+

Upvotes: 1

Steven
Steven

Reputation: 15318

Something like this probably :

from pyspark.sql import functions as F

df.select(
    "*",
    *(
        F.when(F.lit(val).isin(F.col("sentence")), 1).oterhwise(0).alias(f"{val}_bool")
        for val in List
    ),
    *(
        F.when(F.lit(val).isin(F.col("sentence")), F.col("value"))
        .oterhwise(0)
        .alias(f"{val}_value")
        for val in List
    ),
)

Upvotes: 0

Related Questions