Convert nested json to dataframe in scala spark

Question

I want to create the dataframe out of json for only given key. It values is a list and that is nested json type. I tried for flattening but I think there could be some workaround as I only need one key of json to convert into dataframe. I have json like:

   ("""
       {
            "Id_columns": 2,
            "metadata": [{
            "id": "1234",
            "type": "file",
            "length": 395
            }, {
            "id": "1235",
            "type": "file2",
            "length": 396
            }]
    }""")

Now I want to create a DataFrame using spark for only key 'metadata', I have written code:

val json = Json.parse("""
           {
                "Id_columns": 2,
                "metadata": [{
                "id": "1234",
                "type": "file",
                "length": 395
                }, {
                "id": "1235",
                "type": "file2",
                "length": 396
                }]
        }""")

var jsonlist = Json.stringify(json("metadata"))
val rddData = spark.sparkContext.parallelize(jsonlist)
resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
resultDF.show()

But it's giving me error:

overloaded method value json with alternatives:
cannot be applied to (org.apache.spark.rdd.RDD[Char])
[error] val resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)

                                                                                      ^

I am expecting result:

+----+-----+--------+
| id | type| length |
+----+-----+--------+
|1234|file1| 395    |
|1235|file2| 396    |
+----+-----+--------+

baitmbarek · Accepted Answer

You need to explode your array like this :

import spark.implicits._
import org.apache.spark.sql.functions._

val df = spark.read.json(
      spark.sparkContext.parallelize(Seq("""{"Id_columns":2,"metadata":[{"id":"1234","type":"file","length":395},{"id":"1235","type":"file2","length":396}]}"""))
    )

df.select(explode($"metadata").as("metadata"))
      .select("metadata.*")
      .show(false)

Output :

+----+------+-----+
|id  |length|type |
+----+------+-----+
|1234|395   |file |
|1235|396   |file2|
+----+------+-----+

Convert nested json to dataframe in scala spark

Answers (1)

Related Questions