gashu
gashu

Reputation: 873

Pyspark dataframe split json column values into top-level multiple columns

I have a json column which can contain any no of key:value pairs. I want to create new top level columns for these key:value pairs. For Eg if I have this data

A                                       B
"{\"C\":\"c\" , \"D\":\"d\"...}"        b

This is the output that i want

B   C   D  ...
b   c   d

There are few questions similar to splitting the coulmns into multiple columns but none are working in this case. Can Anyone please help. Thanks in Advance!

Upvotes: 2

Views: 3619

Answers (1)

Garren S
Garren S

Reputation: 5792

You are looking for org.apache.spark.sql.functions.from_json: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@from_json(e:org.apache.spark.sql.Column,schema:String,options:java.util.Map[String,String]):org.apache.spark.sql.Column

Here's the python code commit related to SPARK-17699: https://github.com/apache/spark/commit/fe33121a53384811a8e094ab6c05dc85b7c7ca87

Sample Usage from commit:

    >>> from pyspark.sql.types import *
    >>> data = [(1, '''{"a": 1}''')]
    >>> schema = StructType([StructField("a", IntegerType())])
    >>> df = spark.createDataFrame(data, ("key", "value"))
    >>> df.select(from_json(df.value, schema).alias("json")).collect()
    [Row(json=Row(a=1))]

Upvotes: 2

Related Questions