Reputation: 163
I am new to PySpark, I have a dataframe containing a customer id and text, with an associated value
+------+-----+------+
|id |text |value |
+------+-----+------+
| 1 | Cat| 5|
| 1 | Dog| 4|
| 2 | Oil| 1|
I would like to parse the text column, based on a list of keywords, and create a column that tells me wether a keyword is in the text field and extract the associated value, the expected result is this
List_keywords = ["Dog", Cat"]
Out
+------+-----+------+--------+---------+--------+---------+
|id |text |value |bool_Dog|value_Dog|bool_cat|value_cat|
+------+-----+------+--------+---------+--------+---------+
| 1 | Cat| 5|0 | 0| 1| 5|
| 1 | Dog| 4|1 | 4| 0| 0|
| 2 | Oil| 1|0 | 0| 0| 0|
What is the best way to do that? I was thinking of creating a list or a dictionary containing my keywords, and parsing it with a for loop, but I'm sure there is a better way to do that.
Upvotes: 0
Views: 164
Reputation: 1353
Please see the solution
import pyspark.sql.functions as F
data = [[1, 'Cat', 5], [1, 'Dog', 4], [2, 'Oil', 3]]
df = spark.createDataFrame(data, ['id', 'text', 'value'])
df.show()
+---+----+-----+
| id|text|value|
+---+----+-----+
| 1| Cat| 5|
| 1| Dog| 4|
| 2| Oil| 1|
+---+----+-----+
keywords = ['Dog','Cat']
(
df
.groupby('id', 'text', 'value')
.pivot('text', keywords)
.agg(
F.count('value').alias('bool'),
F.max('value').alias('value')
)
.fillna(0)
.sort('text')
).show()
+---+----+-----+--------+---------+--------+---------+
| id|text|value|Dog_bool|Dog_value|Cat_bool|Cat_value|
+---+----+-----+--------+---------+--------+---------+
| 1| Cat| 5| 0| 0| 1| 5|
| 1| Dog| 4| 1| 4| 0| 0|
| 2| Oil| 1| 0| 0| 0| 0|
+---+----+-----+--------+---------+--------+---------+
Upvotes: 1