Reputation: 10429
Here is my JSON
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
]
Here is my parse code:
val mdf = sparkSession.read.option("multiline", "true").json("multi2.json")
mdf.show(false)
This outputs:
+---------------+---------+--------+----+-------+
|_corrupt_record|array |dict |int |string |
+---------------+---------+--------+----+-------+
|[ |null |null |null|null |
|null |[1, 2, 3]|[value1]|1 |string1|
|null |[2, 4, 6]|[value2]|2 |string2|
|] |null |null |null|null |
+---------------+---------+--------+----+-------+
Why do I have a _corrupt_record, everything looks ok? Why does the dict column only give the values and not the keys?
Thanks
Upvotes: 0
Views: 690
Reputation: 2328
"multiLine" option is supported from Spark 2.2.0 onwards.
Contrast it with 2.1.0 documentation
With > 2.2.0, your example code with the data , works.
Regarding the dict
column, it will still show the values only but the schema is preserved. You can verify with:
scala> mdf.printSchema
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
EDIT I realized, much of the info is already here
Upvotes: 1