Reputation: 159
I have a single line nested JSON in below format
{"2018-10-09": {"CAD": 1.4861, "JPY": 129.45, "NZD": 1.7784, "USD": 1.1435}, "2018-11-30": {"CAD": 1.5116, "JPY": 128.99, "NZD": 1.6556, "USD": 1.1359}}
This is just a part of the JSON,actual JSON is really big.
I need to read this JSON using spark and do some preprocessing.
Here is the code
df = spark.read.format("json") \
.option("inferSchema", "true") \
.option("header", 'false') \
.option("sep", ",") \
.load(file_location)
display(df)
The schema looks like below
I want the data to look like something like this
date CAD JPY NZD USD
2018-01-02 1.51 135.35 1.69 1.2
2018-01-03 1.50 134.97 1.69 1.209
so my new schema should contain 5 fields.
i tried to achieve it like below
df1=df.select("2018-12-27.CAD","2018-12-27.JPY","2018-12-27.NZD","2018-12-27.USD")
but it involves selecting each of the date value and there are 100s of date values.This is painful and time consuming.
what is the efficient way of doing this?
Upvotes: 0
Views: 184
Reputation: 10382
Below code is in scala
, May be you can convert this to python
.
scala> df.show(false)
+--------------------------------+--------------------------------+
|2018-10-09 |2018-11-30 |
+--------------------------------+--------------------------------+
|[1.4861, 129.45, 1.7784, 1.1435]|[1.5116, 128.99, 1.6556, 1.1359]|
+--------------------------------+--------------------------------+
scala> val columns = df.columns
scala>
val colExp =
array(
columns
.map(c =>
List(
lit(c).as("date"),
col(s"${c}.*")
)
)
.map(l => struct(l:_*)):_*
)
scala> df.select(explode(colExp).as("data")).select("data.*").show(false)
+----------+------+------+------+------+
|date |CAD |JPY |NZD |USD |
+----------+------+------+------+------+
|2018-10-09|1.4861|129.45|1.7784|1.1435|
|2018-11-30|1.5116|128.99|1.6556|1.1359|
+----------+------+------+------+------+
Explanation
val colExp =
array(
columns
.map(c => List(lit(c).as("date"),col(s"${c}.*"))) // This one will create `2018-10-09` as `date`, `2018-10-09.*`
.map(l => struct(l:_*)):_* // This will add above columns inside struct(_)
)
Above code will give you below output.
array(
struct(
lit("2018-10-09").as(`date`),
col("2018-10-09.*")
),
struct(
lit("2018-11-30").as(`date`),
col("2018-11-30.*")
)
)
Once you have above output, It is easy you just have to explode
array & extract struct
columns.
Upvotes: 1