PySpark: create column based on value and dictionary in columns

Question

I have a PySpark dataframe with values and dictionaries that provide a textual mapping for the values. Not every row has the same dictionary and the values can vary too.

| value    | dict                                           | 
| -------- | ---------------------------------------------- |
| 1        | {"1": "Text A", "2": "Text B"}                 |
| 2        | {"1": "Text A", "2": "Text B"}                 |
| 0        | {"0": "Another text A", "1": "Another text B"} |

I want to make a "status" column that contains the right mapping.


| value    | dict                             | status   |
| -------- | -------------------------------  | -------- |
| 1        | {"1": "Text A", "2": "Text B"}   | Text A   |
| 2        | {"1": "Text A", "2": "Text B"}   | Text B   |
| 0        | {"0": "Other A", "1": "Other B"} | Other A  |

I have tried this code:

df.withColumn("status", F.col("dict").getItem(F.col("value"))

This code does not work. With a hard coded value, like "2", the same code does provide an output, but of course not the right one:

df.withColumn("status", F.col("dict").getItem("2"))

Could someone help me with getting the right mapped value in the status column?

EDIT: my code did work, except for the fact that my "value" was a double and the keys in dict are strings. When casting the column from double to int to string, the code works.

Banu · Accepted Answer

Here are my 2 cents

Create the dataframe by reading from CSV or any other source (in my case it is just static data)

 from pyspark.sql.types import *

 data = [
 (1 , {"1": "Text A", "2": "Text B"}),
 (2 , {"1": "Text A", "2": "Text B"}),
 (0 , {"0": "Another text A", "1": "Another text B"} )
 ]


 schema = StructType([
                     StructField("ID",StringType(),True),
                     StructField("Dictionary",MapType(StringType(),StringType()),True),
                     ])

 df = spark.createDataFrame(data,schema=schema)
 df.show(truncate=False)

Then directly extract the dictionary value based on the id as a key.

df.withColumn('extract',df.Dictionary[df.ID]).show(truncate=False)

Check the below image for reference:

PySpark: create column based on value and dictionary in columns

Answers (2)

Related Questions