Spark parse JSON consisting of only array and integer

Question

I have a file that contains one line

[[1],[2,3]]

I think this is a valid json file and I want to read it in Spark, so I tried

df = spark.read.json('file:/home/spark/testSparkJson.json')
df.head()
Row(_corrupt_record=u'[[1],[2,3]]')

It seems to me that Spark failed to parse this file, and I want Spark to read it as Array of Array of Long in a column, so that I can have

df.head()
Row(sequence=[[1], [2, 3]])
df.printSchema()
root
 |-- sequence: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: long (containsNull = true)

how can I do this?

I'm using pyspark in Spark 2.1.0 now, any solution base on other language/previous versions are also welcome.

Mariusz · Accepted Answer

Spark requires every json line to have one json dictionary and you have array. If you change file content to:

{"sequence": [[1],[2,3]]}

then spark will create schema as you wanted:

>>> spark.read.json("/tmp/sample.json").printSchema()
root
 |-- sequence: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: long (containsNull = true)

Spark parse JSON consisting of only array and integer

Answers (1)

Related Questions