Reputation: 163
I have a dataframe like this
+---+-------------------------+-------+
id | sentence | value |
+---+-------------------------+-------+
1 |dogs love fruits | 50|
1 |Mary is in the kitchen | 45|
2 |John eats fruits | 56|
+---+-------------------------+-------+
I have a list of words, and I would like to create multiple columns based on sting matching, the output would look like this
List = ["fruits", "kitchen"]
+---+-------------------------+-------+-----------+------------+------------+-------------+
id | sentence | value |fruits_bool|fruits_value|kitchen_bool|kitchen_value
+---+-------------------------+-------+-----------+------------+------------+-------------+
1 |dogs love fruits | 50|1 |50 |0 |0
1 |Mary is in the kitchen | 45|0 |0 |1 |45
2 |John eats fruits | 56|1 |56 |0 |0
+---+-------------------------+-------+-----------+------------+------------+-------------+
I can easily create these columns word by word, but I would like to do automatically based on the list. Here's the code I tried
groupby("id"). \
pivot('sentence', List). \
agg(
F.count('sentence').alias('bool')
F.max('value').alias('value')
). \
fillna(0)
I get an empty dataframe. What is the best way to do it ?
Upvotes: 1
Views: 211
Reputation: 701
You can use F.contains
and cast it to integer
df\
.withColumn('fruits_bool',F.col('sentence').contains(F.lit('fruits')).cast('integer'))\
.withColumn('fruits_value',F.col('fruits_bool')*F.col('value'))\
.withColumn('kitchen_bool',F.col('sentence').contains(F.lit('kitchen')).cast('integer'))\
.withColumn('kitchen_value',F.col('kitchen_bool')*F.col('value')).show()
output :
+---+--------------------+-----+-----------+------------+------------+-------------+
| id| sentence|value|fruits_bool|fruits_value|kitchen_bool|kitchen_value|
+---+--------------------+-----+-----------+------------+------------+-------------+
| 1| dogs love fruits| 50| 1| 50| 0| 0|
| 2|Mary is in the ki...| 45| 0| 0| 1| 45|
| 3| John eats fruits| 56| 1| 56| 0| 0|
+---+--------------------+-----+-----------+------------+------------+-------------+
Upvotes: 1
Reputation: 15318
Something like this probably :
from pyspark.sql import functions as F
df.select(
"*",
*(
F.when(F.lit(val).isin(F.col("sentence")), 1).oterhwise(0).alias(f"{val}_bool")
for val in List
),
*(
F.when(F.lit(val).isin(F.col("sentence")), F.col("value"))
.oterhwise(0)
.alias(f"{val}_value")
for val in List
),
)
Upvotes: 0