Reputation: 199
This is the JSON File [https://drive.google.com/file/d/1Jb3OdoffyA71vYfojxLedZNPDLq9bn7b/view?usp=sharing]
I am new to SCALA, I am learning how to use SCALA to parse JSON files and ingest them into Spark as a table. I know how to do that in Python but I am having trouble doing it in SCALA.
The table/dataframe will look like this after parsing the JSON file below
id pub_date doc_id unique_id c_id p_id type source
lni001 20220301 7098727 64WP-UI-POLI 002 P02 org internet
lni001 20220301 7098727 64WP-UI-POLI 002 P02 org internet
lni001 20220301 7098727 64WP-UI-POLI 002 P02 org internet
lni002 20220301 7097889 64WP-UI-CFGT 012 K21 location internet
lni002 20220301 7097889 64WP-UI-CFGT 012 K21 location internet
That will be great if I can get some help or ideas on how to do this. Thanks!
Here is the code that I used
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val df = spark.read.option("multiline", true).json("json_path")
df.show()
But the code cannot parse the nested part (content field). Here is a peak of the data
{
"id":"lni001",
"pub_date":"20220301",
"doc_id":"7098727",
"unique_id":"64WP-UI-POLI",
"content":[
{
"c_id":"002",
"p_id":"P02",
"type":"org",
"source":"internet"
},
{
"c_id":"002",
"p_id":"P02",
"type":"org",
"source":"internet"
},
{
"c_id":"002",
"p_id":"P02",
"type":"org",
"source":"internet"
}
]
}
Upvotes: 0
Views: 797
Reputation: 26
You should specify schema,spark may unable to infer schema internally . You can try this way:
val schema= StructType(Array(
StructField("id",StringType),
StructField("pub_date",StringType),
StructField("doc_id",StringType),
StructField("unique_id",StringType),
StructField("content",ArrayType(MapType(StringType,StringType)))))
spark.read
.option("multiline", true)
.schema(schema)
.json("path")
.show(false)
Upvotes: 1