Reputation: 13
I want to create a dataframe from a complex JSON in String format using Spark scala.
Spark version is 3.1.2. Scala version is 2.12.14.
The source data is like below:
{
"info": [
{
"done": "time",
"id": 9,
"type": "normal",
"pid": 202020,
"add": {
"fields": true,
"stat": "not sure"
}
},
{
"done": "time",
"id": 14,
"type": "normal",
"pid": 764310,
"add": {
"fields": true,
"stat": "sure"
}
},
{
"done": "time",
"id": 9,
"type": "normal",
"pid": 202020,
"add": {
"note": {
"id": 922,
"score": 0
}
}
}
],
"more": {
"a": "ok",
"b": "fine",
"c": 3
}
}
I have tried following things but not working.
val schema = new StructType().add("info", ArrayType(StringType)).add("more", StringType)
val rdd = ss.sparkContext.parallelize(Seq(Row(data))) // data is as mentioned above JSON
val df = ss.createDataFrame(rdd, schema)
df.printSchema()
schema gets printed as below
root
|-- info: array (nullable = true)
| |-- element: string (containsNull = true)
|-- more: string (nullable = true)
print(df.head())
Above line throws exception java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of array<string>
Please help me to do this.
Upvotes: 0
Views: 266
Reputation: 13
I found a solution by doing this, worked for me:
val schema = new StructType().add("data", StringType)
val rdd = ss.sparkContext.parallelize(Seq(Row(data)))
val df = ss.createDataFrame(rdd, schema)
df.printSchema()
println(df.head().getAs("data").toString)
Upvotes: 0
Reputation: 1183
If the data resides in files in HDFS/S3 etc you can easily ready them using spark.read.json function
Something like this should work on hdfs
val df = spark.read.option("multiline","true").json("hdfs:///home/vikas/sample/*.json")
on s3 it would be
val df = spark.read.option("multiline","true").json("s3a://vikas/sample/*.json")
please ensure that you have read access to the path to read the files
As mentioned in your comment, you are reading data from an API, in that case, the follwoing should work for spark 2.2 and above
import spark.implicits._
val jsonStr = """{ "metadata": { "key": 84896, "value": 54 }}"""
val df = spark.read.json(Seq(jsonStr).toDS)
Upvotes: 1