Raja Sabarish PV
Raja Sabarish PV

Reputation: 115

Convert Nested JSON into a DataFrame using Spark/Scala

I have a nested JSON where I need to convert into flattened DataFrame without defining or exploding any column names in it.

val df = sqlCtx.read.option("multiLine",true).json("test.json")

So this is how my data looks like :

[
  {
    "symbol": “TEST3",
    "timestamp": "2019-05-07 16:00:00",
    "priceData": {
      "open": "1177.2600",
      "high": "1179.5500",
      "low": "1176.6700",
      "close": "1179.5500",
      "volume": "49478"
    }
  },
  {
    "symbol": “TEST4",
    "timestamp": "2019-05-07 16:00:00",
    "priceData": {
      "open": "189.5660",
      "high": "189.9100",
      "low": "189.5100",
      "close": "189.9100",
      "volume": "267986"
    }
  }
]

Upvotes: 0

Views: 653

Answers (1)

abiratsis
abiratsis

Reputation: 7316

Here is one way using the DataFrameFlattener class implemented by Databricks:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{DataType, StructType}

implicit class DataFrameFlattener(df: DataFrame) {
      def flattenSchema: DataFrame = {
        df.select(flatten(Nil, df.schema): _*)
      }

      protected def flatten(path: Seq[String], schema: DataType): Seq[Column] = schema match {
        case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name, f.dataType))
        case other => col(path.map(n => s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
      }
    }

df.flattenSchema.show

And the output:

+---------------+--------------+-------------+--------------+----------------+------+-------------------+
|priceData.close|priceData.high|priceData.low|priceData.open|priceData.volume|symbol|          timestamp|
+---------------+--------------+-------------+--------------+----------------+------+-------------------+
|      1179.5500|     1179.5500|    1176.6700|     1177.2600|           49478| TEST3|2019-05-07 16:00:00|
|       189.9100|      189.9100|     189.5100|      189.5660|          267986| TEST4|2019-05-07 16:00:00|
+---------------+--------------+-------------+--------------+----------------+------+-------------------+

Or you can just execute a normal select:

df.select(
  "priceData.close", 
  "priceData.high", 
  "priceData.low", 
  "priceData.open", 
  "priceData.volume", 
  "symbol", 
  "timestamp").show

Output:

+---------+---------+---------+---------+------+------+-------------------+
|    close|     high|      low|     open|volume|symbol|          timestamp|
+---------+---------+---------+---------+------+------+-------------------+
|1179.5500|1179.5500|1176.6700|1177.2600| 49478| TEST3|2019-05-07 16:00:00|
| 189.9100| 189.9100| 189.5100| 189.5660|267986| TEST4|2019-05-07 16:00:00|
+---------+---------+---------+---------+------+------+-------------------+

Upvotes: 1

Related Questions