how to ignore MatchError then processing a large json file in spark-sql?

Question

I'm trying to process a bunch of large json log files with spark, but it fails every time with scala.MatchError, Whether I give it schema or not.

I just want to skip lines that does not match schema, but I can't find how in docs of spark.

I know write a json parser and map it to json file RDD can get things done, but I want to use sqlContext.read.schema(schema).json(fileNames).selectExpr(...) because it's much easier to maintain.

Arnon Rotem-Gal-Oz · Accepted Answer

This will be solved in Spark 1.6.1 https://issues.apache.org/jira/browse/SPARK-12057

For now you can compile a version of spark which includes the fix (essentially raising a parsing exception instead of a general exception on a MatchError and then reporting the record as corrupt - see code https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala )

how to ignore MatchError then processing a large json file in spark-sql?

Answers (1)

Related Questions