Flow
Flow

Reputation: 57

Reduce a json string column into a key/val column

i have a dataframe with the following structure :

| a        | b    |           c                                             |
-----------------------------------------------------------------------------
|01        |ABC   |    {"key1":"valueA","key2":"valueC"}                    |
|02        |ABC   |    {"key1":"valueA","key2":"valueC"}                    |
|11        |DEF   |    {"key1":"valueB","key2":"valueD", "key3":"valueE"}   |
|12        |DEF   |    {"key1":"valueB","key2":"valueD", "key3":"valueE"}   |

i would like to turn into something like :

| a        | b    |      key         |       value     |
--------------------------------------------------------
|01        |ABC   |    key1          |     valueA      |
|01        |ABC   |    key2          |     valueC      |
|02        |ABC   |    key1          |     valueA      |
|02        |ABC   |    key2          |     valueC      |
|11        |DEF   |    key1          |     valueB      |
|11        |DEF   |    key2          |     valueD      |
|11        |DEF   |    key3          |     valueE      |
|12        |DEF   |    key1          |     valueB      |
|12        |DEF   |    key2          |     valueD      |
|12        |DEF   |    key3          |     valueE      |

in an efficient way, as the dataset can be quite large.

Upvotes: 0

Views: 208

Answers (1)

notNull
notNull

Reputation: 31540

Try using from_json function then explode the array.

Example:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val df=Seq(("01","ABC","""{"key1":"valueA","key2":"valueC"}""")).toDF("a","b","c")
val Schema = MapType(StringType, StringType)
df.withColumn("d",from_json(col("c"),Schema)).selectExpr("a","b","explode(d)").show(10,false)
//+---+---+----+------+
//|a  |b  |key |value |
//+---+---+----+------+
//|01 |ABC|key1|valueA|
//|01 |ABC|key2|valueC|
//+---+---+----+------+

Upvotes: 3

Related Questions