Layale
Layale

Reputation: 163

Creating a column based on a list or dictionary

I am new to PySpark, I have a dataframe containing a customer id and text, with an associated value

+------+-----+------+
|id    |text |value |
+------+-----+------+
|   1  |  Cat|     5|
|   1  |  Dog|     4|
|   2  |  Oil|     1|

I would like to parse the text column, based on a list of keywords, and create a column that tells me wether a keyword is in the text field and extract the associated value, the expected result is this

List_keywords = ["Dog", Cat"]

Out

+------+-----+------+--------+---------+--------+---------+
|id    |text |value |bool_Dog|value_Dog|bool_cat|value_cat|
+------+-----+------+--------+---------+--------+---------+
|   1  |  Cat|     5|0       |        0|       1|        5|
|   1  |  Dog|     4|1       |        4|       0|        0|
|   2  |  Oil|     1|0       |        0|       0|        0|

What is the best way to do that? I was thinking of creating a list or a dictionary containing my keywords, and parsing it with a for loop, but I'm sure there is a better way to do that.

Upvotes: 0

Views: 164

Answers (1)

Grzegorz
Grzegorz

Reputation: 1353

Please see the solution

import pyspark.sql.functions as F
data = [[1, 'Cat', 5], [1, 'Dog', 4], [2, 'Oil', 3]]
df = spark.createDataFrame(data, ['id', 'text', 'value'])
df.show()

+---+----+-----+
| id|text|value|
+---+----+-----+
|  1| Cat|    5|
|  1| Dog|    4|
|  2| Oil|    1|
+---+----+-----+
keywords = ['Dog','Cat']

(
    df
    .groupby('id', 'text', 'value')
    .pivot('text', keywords)
    .agg(
        F.count('value').alias('bool'),
        F.max('value').alias('value')
    )
    .fillna(0)
    .sort('text')
).show()

+---+----+-----+--------+---------+--------+---------+
| id|text|value|Dog_bool|Dog_value|Cat_bool|Cat_value|
+---+----+-----+--------+---------+--------+---------+
|  1| Cat|    5|       0|        0|       1|        5|
|  1| Dog|    4|       1|        4|       0|        0|
|  2| Oil|    1|       0|        0|       0|        0|
+---+----+-----+--------+---------+--------+---------+

Upvotes: 1

Related Questions