Parsing nested JSON in spark and imposing custom schema

Question

I have a single line nested JSON in below format

{"2018-10-09": {"CAD": 1.4861, "JPY": 129.45, "NZD": 1.7784, "USD": 1.1435}, "2018-11-30": {"CAD": 1.5116, "JPY": 128.99, "NZD": 1.6556, "USD": 1.1359}}

This is just a part of the JSON,actual JSON is really big.

I need to read this JSON using spark and do some preprocessing.

Here is the code

df = spark.read.format("json") \
  .option("inferSchema", "true") \
  .option("header", 'false') \
  .option("sep", ",") \
  .load(file_location)

display(df)

The schema looks like below

I want the data to look like something like this

date        CAD   JPY     NZD   USD

2018-01-02  1.51  135.35  1.69  1.2

2018-01-03  1.50  134.97  1.69  1.209

so my new schema should contain 5 fields.

i tried to achieve it like below

df1=df.select("2018-12-27.CAD","2018-12-27.JPY","2018-12-27.NZD","2018-12-27.USD")

but it involves selecting each of the date value and there are 100s of date values.This is painful and time consuming.

what is the efficient way of doing this?

s.polam · Accepted Answer

Below code is in scala, May be you can convert this to python.

scala> df.show(false)
+--------------------------------+--------------------------------+
|2018-10-09                      |2018-11-30                      |
+--------------------------------+--------------------------------+
|[1.4861, 129.45, 1.7784, 1.1435]|[1.5116, 128.99, 1.6556, 1.1359]|
+--------------------------------+--------------------------------+

scala> val columns = df.columns

scala> 
val colExp = 
array(
    columns
    .map(c => 
        List(
            lit(c).as("date"),
            col(s"${c}.*")
        )
    )
    .map(l => struct(l:_*)):_*
)

scala> df.select(explode(colExp).as("data")).select("data.*").show(false)
+----------+------+------+------+------+
|date      |CAD   |JPY   |NZD   |USD   |
+----------+------+------+------+------+
|2018-10-09|1.4861|129.45|1.7784|1.1435|
|2018-11-30|1.5116|128.99|1.6556|1.1359|
+----------+------+------+------+------+

Explanation

val colExp = 
array(
    columns
    .map(c => List(lit(c).as("date"),col(s"${c}.*"))) // This one will create `2018-10-09` as `date`, `2018-10-09.*`
    .map(l => struct(l:_*)):_* // This will add above columns inside struct(_)
)

Above code will give you below output.

array(
    struct(
        lit("2018-10-09").as(`date`), 
        col("2018-10-09.*")
    ), 
    struct(
        lit("2018-11-30").as(`date`), 
        col("2018-11-30.*")
    )
)

Once you have above output, It is easy you just have to explode array & extract struct columns.

Parsing nested JSON in spark and imposing custom schema

Answers (1)

Related Questions