Reputation: 2361
I have a relatively simple program written in C++ and I have been using Hadoop Streaming for MapReduce jobs (my version of Hadoop is Cloudera).
Recently, I found that a lot of streaming tasks are keep failing and restarted by task tracker while they finish successfully at the end. I tracked the user logs and it seems some MapReduce tasks are getting zero input. Specifically the error message looks like this:
HOST=null
USER=mapred
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |TCGA-06-0216-0000024576-0000008192 0 27743 10716|
Date: Sun Apr 29 15:55:51 EDT 2012
java.io.IOException: Broken pipe
Sometimes the error rate is pretty high (nearly 50%). I don't think it's normal. Does anyone know
a) What's going on ?
b) How can I fix it ?
Thanks
Upvotes: 0
Views: 574
Reputation: 91
Does your data have a lot of characters in other languages (e.g. Chinese)?
If so, check your character encoding setting in (1) JVM for your Hadoop cluster : it is likely to be set at UTF-8 by default. (2) your mapper / reducer : make sure your mapper / reducer emits characters in UTF-8 (or whichever char encoding you have set your JVM)
Upvotes: 1