Reputation: 544
I have a pyspark dataframe which looks like this-
RowNumber | value |
---|---|
1 | [{[email protected], Name=abc}, {[email protected], Name=mnc}] |
2 | [{[email protected], Name=klo}, {[email protected], Name=mmm}] |
The column "value" is of string type.
root
|--value: string (nullable = false)
|--rowNumber: integer (nullable = false)
Step 1 I need to explode the dictionaries inside the list on each row under column "value" like this-
Step 2 And then to further explode the column so that the resultant table looks like :
Although When I try to get to Step1 using:
df.select(explode(col('value')).alias('value'))
it shows me error:
Analysis Exception: cannot resolve 'explode("value")' due to data type mismatch: input to function explode should be array or map type, not string
How do I convert this string under column 'value' to compatible data types so that I can proceed with exploding the dictionary elements as valid array/json (step1) and then into separate columns (step2) ?
please help
Upvotes: 1
Views: 2653
Reputation: 1026
EDIT: There may be a simpler way to do this with the from_json
function and StructTypes
, for that you can check out this link.
ast
wayTo parse a string as a dictionary or list in Python, you can use the ast
library's literal_eval
function. If we convert this function to a PySpark UDF, then the following code will suffice:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import MapType
from ast import literal_eval
literal_eval_udf = udf(literal_eval, ArrayType(MapType()))
table = table.withColumn("value", literal_eval_udf(col("value"))) # Make strings into ArrayTypes of MapTypes
table = table.withColumn("value", explode(col("value"))) # Explode ArrayTypes such that each row contains a MapType
After applying these functions to the table, what should remain is what you originally referred to as the start of "step 2." From here, we want to split each "value"
column key into a column with entries from the corresponding value. This is accomplished with another function application which gives us the dict values:
table = table.withColumn("value", map_values(col("value")))
Now the values
column contains an ArrayType
of the values contained in each dictionary. To make a separate column for each of these, we simply add them in a loop:
keys = ['mail', 'Name']
for k in range(len(keys)):
table = table.withColumn(keys[k], table.value[k])
Then you can drop the original value
column because we wouldn't need it anymore, as you'll now have the columns mail
and Name
with the information from the corresponding maps.
Upvotes: 1