iamorozov
iamorozov

Reputation: 811

Apache Spark can't read parquet folder that is being written with streaming job

When I try to read parquet folder, that is currently being written with another spark streaming job, using an option "mergeSchema":"true", I get an Error:

java.io.IOException: Could not read footer for file
val df = spark
    .read
    .option("mergeSchema", "true")
    .parquet("path.parquet")

Without schema merging I can read the folder nicely but is it possible to read such a folder with schema merging regardless of possible side jobs updating it?

Full exception:

java.io.IOException: Could not read footer for file: FileStatus{path=hdfs://path.parquet/part-00000-20199ef6-4ff8-4ee0-93cc-79d47d2da37d-c000.snappy.parquet; isDirectory=false; length=0; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:551)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:538)
    at org.apache.spark.util.ThreadUtils$$anonfun$3$$anonfun$apply$1.apply(ThreadUtils.scala:287)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: hdfs://path.parquet/part-00000-20199ef6-4ff8-4ee0-93cc-79d47d2da37d-c000.snappy.parquet is not a Parquet file (too small length: 0)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:544)
    ... 9 more

Upvotes: 7

Views: 6284

Answers (1)

moriarty007
moriarty007

Reputation: 2214

Run following before creating your dataframe:

spark.sql("set spark.sql.files.ignoreCorruptFiles=true")

i.e. Enable this config - spark.sql.files.ignoreCorruptFiles

As stated here, If this config is true, the Spark jobs will continue to run when encountering corrupted or non-existing files and contents that have been read will still be returned. Also, this config is used by the merge schema flow.

It is available from Spark 2.1.1+

Upvotes: 12

Related Questions