Reputation: 11
When reading text files stored on HDFS, if I run into a BlockMissingException (or some other exception) while reading those files with sc.textFile, how can I catch the error and proceed with an emptyRDD?
The reason I might run into a BlockMissingException is if, for example, the files are stored on HDFS with a replication factor of 1 and a data node goes down.
Consider the following minimum example code:
val myRDD: RDD[String] =
try {
sc.textFile("hdfs:///path/to/fileWithMissingBlock")
} catch {
case e: BlockMissingException =>
println("missing block, continuing with empty RDD")
sc.emptyRDD[String]
case e: Throwable =>
println("unknown exception, continuting with empty RDD")
sc.emptyRDD[String]
}
val nLines = myRDD.count
println("There are " + nLines + " lines")
This program fails instead of producing the desired count of 0, in the case the file has a missing block. Here is the exception I recieve
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: ...
I understand that spark may run things out of order so exception handling may be be better situated within an RDD.map (e.g. Apache spark scala Exception handling), but what if the RDD hasn't been created yet?
Upvotes: 1
Views: 596
Reputation: 1110
Because when you call sc.textFile("hdfs:///path/to/fileWithMissingBlock")
, spark doesn't do anything (lazy evaluation) i.e: read file from your filesystem.
It actually execute when a action
is called, here is then count
method. That monment exeception come in.
Upvotes: 1