Reputation: 3542
I have a file that contains one line
[[1],[2,3]]
I think this is a valid json file and I want to read it in Spark, so I tried
df = spark.read.json('file:/home/spark/testSparkJson.json')
df.head()
Row(_corrupt_record=u'[[1],[2,3]]')
It seems to me that Spark failed to parse this file, and I want Spark to read it as Array of Array of Long in a column, so that I can have
df.head()
Row(sequence=[[1], [2, 3]])
df.printSchema()
root
|-- sequence: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: long (containsNull = true)
how can I do this?
I'm using pyspark in Spark 2.1.0 now, any solution base on other language/previous versions are also welcome.
Upvotes: 1
Views: 292
Reputation: 13946
Spark requires every json line to have one json dictionary and you have array. If you change file content to:
{"sequence": [[1],[2,3]]}
then spark will create schema as you wanted:
>>> spark.read.json("/tmp/sample.json").printSchema()
root
|-- sequence: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: long (containsNull = true)
Upvotes: 1