Arjun
Arjun

Reputation: 13

How can i create a dataframe from a complex JSON in string format using Spark scala

I want to create a dataframe from a complex JSON in String format using Spark scala.

Spark version is 3.1.2. Scala version is 2.12.14.

The source data is like below:

{
  "info": [
    {
      "done": "time",
      "id": 9,
      "type": "normal",
      "pid": 202020,
      "add": {
        "fields": true,
        "stat": "not sure"
      }
    },
    {
      "done": "time",
      "id": 14,
      "type": "normal",
      "pid": 764310,
      "add": {
        "fields": true,
        "stat": "sure"
      }
    },
    {
      "done": "time",
      "id": 9,
      "type": "normal",
      "pid": 202020,
      "add": {
        "note": {
          "id": 922,
          "score": 0
        }
      }
    }
  ],
  "more": {
    "a": "ok",
    "b": "fine",
    "c": 3
  }
}

I have tried following things but not working.

val schema = new StructType().add("info", ArrayType(StringType)).add("more", StringType)

val rdd = ss.sparkContext.parallelize(Seq(Row(data))) // data is as mentioned above JSON

val df = ss.createDataFrame(rdd, schema)

df.printSchema()

schema gets printed as below

root
 |-- info: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- more: string (nullable = true)

    print(df.head())

Above line throws exception java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of array<string>

Please help me to do this.

Upvotes: 0

Views: 266

Answers (2)

Arjun
Arjun

Reputation: 13

I found a solution by doing this, worked for me:

val schema = new StructType().add("data", StringType)

val rdd = ss.sparkContext.parallelize(Seq(Row(data)))

val df = ss.createDataFrame(rdd, schema)

df.printSchema()

println(df.head().getAs("data").toString)

Upvotes: 0

Vikas Saxena
Vikas Saxena

Reputation: 1183

If the data resides in files in HDFS/S3 etc you can easily ready them using spark.read.json function

Something like this should work on hdfs

val df = spark.read.option("multiline","true").json("hdfs:///home/vikas/sample/*.json")

on s3 it would be

val df = spark.read.option("multiline","true").json("s3a://vikas/sample/*.json")

please ensure that you have read access to the path to read the files

As mentioned in your comment, you are reading data from an API, in that case, the follwoing should work for spark 2.2 and above

import spark.implicits._
val jsonStr = """{ "metadata": { "key": 84896, "value": 54 }}"""
val df = spark.read.json(Seq(jsonStr).toDS)

Upvotes: 1

Related Questions