Reputation: 700
I am very new to Pyspark. I tried parsing the JSON file using the following code
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:///home/malwarehunter/Downloads/122116-path.json")
df.printSchema()
The output is as follows.
root |-- _corrupt_record: string (nullable = true)
df.show()
The output looks like this
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "time1":"2...|
| "time2":"201...|
| "step":0.5,|
| "xyz":[|
| {|
| "student":"00010...|
| "attr...|
| [ -2.52, ...|
| [ -2.3, -...|
| [ -1.97, ...|
| [ -1.27, ...|
| [ -1.03, ...|
| [ -0.8, -...|
| [ -0.13, ...|
| [ 0.09, -...|
| [ 0.54, -...|
| [ 1.1, -...|
| [ 1.34, 0...|
| [ 1.64, 0...|
+--------------------+
only showing top 20 rows
The Json File looks like this.
{
"time1":"2016-12-16T00:00:00.000",
"time2":"2016-12-16T23:59:59.000",
"step":0.5,
"xyz":[
{
"student":"0001025D0007F5DB",
"attr":[
[ -2.52, -1.17 ],
[ -2.3, -1.15 ],
[ -1.97, -1.19 ],
[ 10.16, 4.08 ],
[ 10.23, 4.87 ],
[ 9.96, 5.09 ] ]
},
{
"student":"0001025D0007F5DC",
"attr":[
[ -2.58, -0.99 ],
[ 10.12, 3.89 ],
[ 10.27, 4.59 ],
[ 10.05, 5.02 ] ]
}
]}
Could you help me on parsing this and creating a Data Frame like this.
Upvotes: 5
Views: 21221
Reputation: 376
I experienced a similar issue. When Spark is reading the Json file, it expects each line to be a separate JSON object. So it will fail if you will try to load a pretty formatted JSON file. My walk around it was to minify the JSON file that Spark was reading.
Upvotes: 1
Reputation: 330093
Spark >= 2.2:
You can use multiLine
argument for JSON reader:
spark.read.json(path_to_input, multiLine=True)
Spark < 2.2
There is almost universal, but rather expensive solution, which can be used to read multiline JSON files:
SparkContex.wholeTextFiles
.DataFrameReader.json
.As long as there are no other problems with your data it should do the trick:
spark.read.json(sc.wholeTextFiles(path_to_input).values())
Upvotes: 17