Reputation: 331
I have a dataframe (in Pyspark) that has one of the row values as a dictionary:
df.show()
And it looks like:
+----+---+-----------------------------+
|name|age|info |
+----+---+-----------------------------+
|rob |26 |{color: red, car: volkswagen}|
|evan|25 |{color: blue, car: mazda} |
+----+---+-----------------------------+
Based on the comments to give more:
df.printSchema()
The types are strings
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- dict: string (nullable = true)
Is it possible to take the keys from the dictionary (color and car) and make them columns in the dataframe, and have the values be the rows for those columns?
Expected Result:
+----+---+-----------------------------+
|name|age|color |car |
+----+---+-----------------------------+
|rob |26 |red |volkswagen |
|evan|25 |blue |mazda |
+----+---+-----------------------------+
I didn't know I had to use df.withColumn() and somehow iterate through the dictionary to pick each one and then make a column out of it? I've tried to find some answers so far, but most were using Pandas, and not Spark, so I'm not sure if I can apply the same logic.
Upvotes: 6
Views: 11136
Reputation: 2829
I feel the most scalable solution is the following one, using the general keys to be passed through the lambda function:
from pyspark.sql.functions import explode,map_keys,col
keysDF = df.select(explode(map_keys(df.info))).distinct()
keysList = keysDF.rdd.map(lambda x:x[0]).collect()
keyCols = list(map(lambda x: col("info").getItem(x).alias(str(x)), keysList))
df.select(df.name, df.age, *keyCols).show()
Upvotes: 0
Reputation: 630
As you can see with the printSchema function your dictionary is understood by "Spark" as a string. The function that slices a string and creates new columns is split () so a simple solution to this problem could be.
Create a UDF that is capable of:
Apply a split and create two new columns from the new format of our dictionary
The code:
@udf()
def transform_dict(dict_str):
str_of_dict_values = dict_str.\
replace("}", "").\
replace("{", ""). \
replace("color:", ""). \
replace(" car: ", ""). \
strip()
# output example: 'red,volkswagen'
return str_of_dict_values
# Create new column with our UDF with the dict values converted to str
df = df.withColumn('info_clean', clean("info"))
# Split these values and store in a tmp variable
split_col = split(df['info_clean'], ',')
# Create new columns with the split values
df = df.withColumn('color', split_col.getItem(0))
df = df.withColumn('car', split_col.getItem(1))
This solution is only correct if we assume that the dictionary elements always come in the same order, and also the keys are fixed. For other more complex cases we could create a dictionary in the UDF function and form the string of list of values by explicitly invoking each of the dictionary keys, so we would ensure that the order in the output chain is maintained.
Upvotes: 0
Reputation: 43504
Your strings:
"{color: red, car: volkswagen}"
"{color: blue, car: mazda}"
are not in a python friendly format. They can't be parsed using json.loads
, nor can it be evaluated using ast.literal_eval
.
However, if you knew the keys ahead of time and can assume that the strings are always in this format, you should be able to use pyspark.sql.functions.regexp_extract
:
For example:
from pyspark.sql.functions import regexp_extract
df.withColumn("color", regexp_extract("info", "(?<=color: )\w+(?=(,|}))", 0))\
.withColumn("car", regexp_extract("info", "(?<=car: )\w+(?=(,|}))", 0))\
.show(truncate=False)
#+----+---+-----------------------------+-----+----------+
#|name|age|info |color|car |
#+----+---+-----------------------------+-----+----------+
#|rob |26 |{color: red, car: volkswagen}|red |volkswagen|
#|evan|25 |{color: blue, car: mazda} |blue |mazda |
#+----+---+-----------------------------+-----+----------+
The pattern is:
(?<=color: )
: A positive look-behind for the literal string "color: "
\w+
: One or more word characters(?=(,|}))
: A positive look-ahead for either a literal comma or close curly brace.Here is how to generalize this for more than two keys, and handle the case where the key does not exist in the string.
from pyspark.sql.functions import regexp_extract, when, col
from functools import reduce
keys = ["color", "car", "year"]
pat = "(?<=%s: )\w+(?=(,|}))"
df = reduce(
lambda df, c: df.withColumn(
c,
when(
col("info").rlike(pat%c),
regexp_extract("info", pat%c, 0)
)
),
keys,
df
)
df.drop("info").show(truncate=False)
#+----+---+-----+----------+----+
#|name|age|color|car |year|
#+----+---+-----+----------+----+
#|rob |26 |red |volkswagen|null|
#|evan|25 |blue |mazda |null|
#+----+---+-----+----------+----+
In this case, we use pyspark.sql.functions.when
and pyspark.sql.Column.rlike
to test to see if the string contains the pattern, before we try to extract the match.
If you don't know the keys ahead of time, you'll either have to write your own parser or try to modify the data upstream.
Upvotes: 2