FooBee
FooBee

Reputation: 798

how to ignore MatchError then processing a large json file in spark-sql?

I'm trying to process a bunch of large json log files with spark, but it fails every time with scala.MatchError, Whether I give it schema or not.

I just want to skip lines that does not match schema, but I can't find how in docs of spark.

I know write a json parser and map it to json file RDD can get things done, but I want to use sqlContext.read.schema(schema).json(fileNames).selectExpr(...) because it's much easier to maintain.

Upvotes: 1

Views: 516

Answers (1)

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25929

This will be solved in Spark 1.6.1 https://issues.apache.org/jira/browse/SPARK-12057

For now you can compile a version of spark which includes the fix (essentially raising a parsing exception instead of a general exception on a MatchError and then reporting the record as corrupt - see code https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala )

Upvotes: 0

Related Questions