How can I catch a BlockMissingException when reading files using Spark sc.textFile?

Question

When reading text files stored on HDFS, if I run into a BlockMissingException (or some other exception) while reading those files with sc.textFile, how can I catch the error and proceed with an emptyRDD?

The reason I might run into a BlockMissingException is if, for example, the files are stored on HDFS with a replication factor of 1 and a data node goes down.

Consider the following minimum example code:

    val myRDD: RDD[String] =
        try {
            sc.textFile("hdfs:///path/to/fileWithMissingBlock")
        } catch {
            case e: BlockMissingException =>
                println("missing block, continuing with empty RDD")
                sc.emptyRDD[String]
            case e: Throwable =>
                println("unknown exception, continuting with empty RDD")
                sc.emptyRDD[String]
        }

    val nLines = myRDD.count
    println("There are " + nLines + " lines")

This program fails instead of producing the desired count of 0, in the case the file has a missing block. Here is the exception I recieve

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: ...

I understand that spark may run things out of order so exception handling may be be better situated within an RDD.map (e.g. Apache spark scala Exception handling), but what if the RDD hasn't been created yet?

How can I catch a BlockMissingException when reading files using Spark sc.textFile?

Answers (1)

Related Questions