Reputation: 9794

Get the data that caused a MapReduce job to crash

There are many situations where I'm writing a data processing program, and new bugs are uncovered only on larger datasets. For example, consider a script that crashes on 1 out of 100 million records (due to unexpected input or whatever); if I'm developing it on a small sample of the data, I won't see that bug. All I can do is stare at the Hadoop's error logs, tweak the script, then re-run the entire job. This is horribly inefficient in both compute and developer time.

What I'd like is a way to download the segment of data the script was processing when it crashed.

Is there an easy way to get this out of Hadoop? (And ideally, Hadoop Streaming?)

Several years ago I learned some horrible trick with digging through the temp directories that Hadoop itself makes... this doesn't seem like a good solution though, and I was hoping there's something better by now.

Upvotes: 1

Answers (2)

Praveen Sripati

Reputation: 33543

What I'd like is a way to download the segment of data the script was processing when it crashed.

"keep.failed.task.files" description is "Should the files for failed tasks be kept. This should only be used on jobs that are failing, because the storage is never reclaimed. It also prevents the map outputs from being erased from the reduce directory as they are consumed."

It's defaulted to false. Change this property to true and the data should be available in case a task fails. The data can be got to a developer machine and the program debugged easily in Eclipse.

All I can do is stare at the Hadoop's error logs, tweak the script, then re-run the entire job. This is horribly inefficient in both compute and developer time.

Also, when a Hadoop jobs encounters a bad record and the task crashes, the record can be ignored and the map/reduce task rerun. There is no need to run the complete job again. Check this Hadoop documentation for more details.

Upvotes: 2

Chris Gerken

Reputation: 16400

I suggest putting a try-catch block around the logic in your setup(), map(), reduce() and cleanup() methods. In the catch block for Exception increment a counter whose group is "Exception" (or whatever) and whose name is the String returned from the exception's getMessage() method. That will let you know at a glance at least what happened. In that catch block you can also write additional information to a file including the stack trace, the key and value (or utterable) passed in, etc.

For debugging, I also like to "Debug as... -> Java Application" the hadoop flow in Eclipse. That's helped me find and fix a bunch of problems in my code.

Upvotes: 1

Get the data that caused a MapReduce job to crash

Answers (2)

Related Questions