ABusy_developerT
ABusy_developerT

Reputation: 544

Explode a string column with dictionary structure in PySpark

I have a pyspark dataframe which looks like this-

RowNumber value
1 [{[email protected], Name=abc}, {[email protected], Name=mnc}]
2 [{[email protected], Name=klo}, {[email protected], Name=mmm}]

The column "value" is of string type.

root
  |--value: string (nullable = false)
  |--rowNumber: integer (nullable = false)

Step 1 I need to explode the dictionaries inside the list on each row under column "value" like this-

enter image description here

Step 2 And then to further explode the column so that the resultant table looks like :

enter image description here

Although When I try to get to Step1 using:

df.select(explode(col('value')).alias('value'))

it shows me error:

Analysis Exception: cannot resolve 'explode("value")' due to data type mismatch: input to function explode should be array or map type, not string

How do I convert this string under column 'value' to compatible data types so that I can proceed with exploding the dictionary elements as valid array/json (step1) and then into separate columns (step2) ?

please help

Upvotes: 1

Views: 2653

Answers (1)

dsillman2000
dsillman2000

Reputation: 1026

EDIT: There may be a simpler way to do this with the from_json function and StructTypes, for that you can check out this link.

The ast way

To parse a string as a dictionary or list in Python, you can use the ast library's literal_eval function. If we convert this function to a PySpark UDF, then the following code will suffice:

from pyspark.sql.functions import udf, col
from pyspark.sql.types import MapType
from ast import literal_eval

literal_eval_udf = udf(literal_eval, ArrayType(MapType()))

table = table.withColumn("value", literal_eval_udf(col("value"))) # Make strings into ArrayTypes of MapTypes
table = table.withColumn("value", explode(col("value"))) # Explode ArrayTypes such that each row contains a MapType

After applying these functions to the table, what should remain is what you originally referred to as the start of "step 2." From here, we want to split each "value" column key into a column with entries from the corresponding value. This is accomplished with another function application which gives us the dict values:

table = table.withColumn("value", map_values(col("value")))

Now the values column contains an ArrayType of the values contained in each dictionary. To make a separate column for each of these, we simply add them in a loop:

keys = ['mail', 'Name']
for k in range(len(keys)):
  table = table.withColumn(keys[k], table.value[k])

Then you can drop the original value column because we wouldn't need it anymore, as you'll now have the columns mail and Name with the information from the corresponding maps.

Upvotes: 1

Related Questions