Parsing Nested JSON Data using SCALA

Question

This is the JSON File [https://drive.google.com/file/d/1Jb3OdoffyA71vYfojxLedZNPDLq9bn7b/view?usp=sharing]

I am new to SCALA, I am learning how to use SCALA to parse JSON files and ingest them into Spark as a table. I know how to do that in Python but I am having trouble doing it in SCALA.

The table/dataframe will look like this after parsing the JSON file below

  id          pub_date      doc_id       unique_id     c_id    p_id    type      source
lni001        20220301      7098727     64WP-UI-POLI    002     P02    org      internet
lni001        20220301      7098727     64WP-UI-POLI    002     P02    org      internet
lni001        20220301      7098727     64WP-UI-POLI    002     P02    org      internet
lni002        20220301      7097889     64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      7097889     64WP-UI-CFGT    012     K21   location  internet

That will be great if I can get some help or ideas on how to do this. Thanks!

Here is the code that I used

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

import spark.implicits._

val df = spark.read.option("multiline", true).json("json_path")
df.show()

But the code cannot parse the nested part (content field). Here is a peak of the data

{
   "id":"lni001",
   "pub_date":"20220301",
   "doc_id":"7098727",
   "unique_id":"64WP-UI-POLI",
   "content":[
      {
         "c_id":"002",
         "p_id":"P02",
         "type":"org",
         "source":"internet"  
      },
      {
         "c_id":"002",
         "p_id":"P02",
         "type":"org",
         "source":"internet" 
      },
      {
         "c_id":"002",
         "p_id":"P02",
         "type":"org",
         "source":"internet" 
      }
   ]
}

user18552101 · Accepted Answer

You should specify schema,spark may unable to infer schema internally . You can try this way:

  val schema= StructType(Array(
  StructField("id",StringType),
  StructField("pub_date",StringType),
  StructField("doc_id",StringType),
  StructField("unique_id",StringType),
  StructField("content",ArrayType(MapType(StringType,StringType)))))
spark.read
 .option("multiline", true)
 .schema(schema)
 .json("path")
 .show(false)

Parsing Nested JSON Data using SCALA

Answers (1)

Related Questions