How can i create a dataframe from a complex JSON in string format using Spark scala

Question

I want to create a dataframe from a complex JSON in String format using Spark scala.

Spark version is 3.1.2. Scala version is 2.12.14.

The source data is like below:

{
  "info": [
    {
      "done": "time",
      "id": 9,
      "type": "normal",
      "pid": 202020,
      "add": {
        "fields": true,
        "stat": "not sure"
      }
    },
    {
      "done": "time",
      "id": 14,
      "type": "normal",
      "pid": 764310,
      "add": {
        "fields": true,
        "stat": "sure"
      }
    },
    {
      "done": "time",
      "id": 9,
      "type": "normal",
      "pid": 202020,
      "add": {
        "note": {
          "id": 922,
          "score": 0
        }
      }
    }
  ],
  "more": {
    "a": "ok",
    "b": "fine",
    "c": 3
  }
}

I have tried following things but not working.

val schema = new StructType().add("info", ArrayType(StringType)).add("more", StringType)

val rdd = ss.sparkContext.parallelize(Seq(Row(data))) // data is as mentioned above JSON

val df = ss.createDataFrame(rdd, schema)

df.printSchema()

schema gets printed as below

root
 |-- info: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- more: string (nullable = true)

    print(df.head())

Above line throws exception java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of array

Please help me to do this.

Vikas Saxena · Accepted Answer

If the data resides in files in HDFS/S3 etc you can easily ready them using spark.read.json function

Something like this should work on hdfs

val df = spark.read.option("multiline","true").json("hdfs:///home/vikas/sample/*.json")

on s3 it would be

val df = spark.read.option("multiline","true").json("s3a://vikas/sample/*.json")

please ensure that you have read access to the path to read the files

As mentioned in your comment, you are reading data from an API, in that case, the follwoing should work for spark 2.2 and above

import spark.implicits._
val jsonStr = """{ "metadata": { "key": 84896, "value": 54 }}"""
val df = spark.read.json(Seq(jsonStr).toDS)

How can i create a dataframe from a complex JSON in string format using Spark scala

Answers (2)

Related Questions