Convert nested json string in Dataset to Dataset/Dataframe in Spark Scala

Question

I have a simple program that has Dataset with column resource_serialized having JSON string as value as below:

import org.apache.spark.SparkConf

object TestApp {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")

    val spark = org.apache.spark.sql.SparkSession
      .builder
      .config(sparkConf)
      .appName("Test")
      .getOrCreate()

    val json = "[{"resource_serialized":"{\"createdOn\":\"2000-07-20 00:00:00.0\",\"genderCode\":\"0\"}","id":"00529e54-0f3d-4c76-9d3"}]"

    import spark.implicits._
    val df = spark.read.json(Seq(json).toDS)
    df.printSchema()
    df.show()
  }
}

Schema printed is:

root
 |-- id: string (nullable = true)
 |-- resource_serialized: string (nullable = true)

Dataset printed on the console is:

+--------------------+--------------------+
|                  id| resource_serialized|
+--------------------+--------------------+
|00529e54-0f3d-4c7...|{"createdOn":"200...|
+--------------------+--------------------+

the resource_serialized field has json string, which is (from debug console)

Now, I need to create dataset/dataframe out of that json string, how can i achieve this?

My goal is to get Dataset like this:

+--------------------+--------------------+----------+
|                  id|           createdOn|genderCode|
+--------------------+--------------------+----------+
|00529e54-0f3d-4c7...|2000-07-20 00:00    |         0|
+--------------------+--------------------+----------+

QuickSilver · Accepted Answer

Below solution will allow you to map all the key values of resource_serialized to (String,String) table which later on can be parse mapped.

import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}

object TestApp {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")

    val spark = org.apache.spark.sql.SparkSession
      .builder
      .config(sparkConf)
      .appName("Test")
      .getOrCreate()

    val json = "[{"resource_serialized":"{\"createdOn\":\"2000-07-20 00:00:00.0\",\"genderCode\":\"0\"}","id":"00529e54-0f3d-4c76-9d3"}]"

    import spark.implicits._
    val df = spark.read.json(Seq(json).toDS)
    val jsonColumn = from_json($"resource_serialized", MapType(StringType, StringType))
    val keysDF = df.select(explode(map_keys(jsonColumn))).distinct()
    val keys = keysDF.collect().map(f=>f.get(0))
    val keyCols = keys.map(f=> jsonColumn.getItem(f).as(f.toString))
    df.select( $"id" +: keyCols:_*).show(false)

  }
}

the output would look like

+----------------------+---------------------+----------+
|id                    |createdOn            |genderCode|
+----------------------+---------------------+----------+
|00529e54-0f3d-4c76-9d3|2000-07-20 00:00:00.0|0         |
+----------------------+---------------------+----------+

Convert nested json string in Dataset to Dataset/Dataframe in Spark Scala

Answers (2)

Related Questions