tzeyl
tzeyl

Reputation: 11

How can I catch a BlockMissingException when reading files using Spark sc.textFile?

When reading text files stored on HDFS, if I run into a BlockMissingException (or some other exception) while reading those files with sc.textFile, how can I catch the error and proceed with an emptyRDD?

The reason I might run into a BlockMissingException is if, for example, the files are stored on HDFS with a replication factor of 1 and a data node goes down.

Consider the following minimum example code:

    val myRDD: RDD[String] =
        try {
            sc.textFile("hdfs:///path/to/fileWithMissingBlock")
        } catch {
            case e: BlockMissingException =>
                println("missing block, continuing with empty RDD")
                sc.emptyRDD[String]
            case e: Throwable =>
                println("unknown exception, continuting with empty RDD")
                sc.emptyRDD[String]
        }

    val nLines = myRDD.count
    println("There are " + nLines + " lines")

This program fails instead of producing the desired count of 0, in the case the file has a missing block. Here is the exception I recieve

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: ...

I understand that spark may run things out of order so exception handling may be be better situated within an RDD.map (e.g. Apache spark scala Exception handling), but what if the RDD hasn't been created yet?

Upvotes: 1

Views: 596

Answers (1)

Thang Nguyen
Thang Nguyen

Reputation: 1110

Because when you call sc.textFile("hdfs:///path/to/fileWithMissingBlock"), spark doesn't do anything (lazy evaluation) i.e: read file from your filesystem.

It actually execute when a action is called, here is then count method. That monment exeception come in.

Upvotes: 1

Related Questions