Reputation: 398
I have a hadoop reduce task attempt that will never fail or get completed unless I fail/kill it manually.
The problem surfaces when the task tracker node (due to network issues that I am still investigating) looses connectivity with other task trackers/data nodes, but not with the job tracker.
Basically, the reduce task is not able to fetch the necessary data from other data nodes due to time out issues and it blacklists them. So far, so good, the blacklisting is expected and needed, the problem is that it will keep retry the same blacklisted hosts for hours (honoring what it seems to be an exponential back-off algorithm) until I manually kill it. Latest long running task had been >9 hours retrying.
I see hundreds of messages like these in the log:
2013-09-09 22:34:47,251 WARN org.apache.hadoop.mapred.ReduceTask (MapOutputCopier attempt_201309091958_0004_r_000044_0.1): attempt_201309091958_0004_r_000044_0 copy failed: attempt_201309091958_0004_m_001100_0 from X.X.X.X
2013-09-09 22:34:47,252 WARN org.apache.hadoop.mapred.ReduceTask (MapOutputCopier attempt_201309091958_0004_r_000044_0.1): java.net.SocketTimeoutException: connect timed out
Is there any way or setting to specify that after n retries or seconds the task should fail on and get restarted on its own in another task tracker host?
These are some of the relevant reduce/timeout Hadoop cluster parameters I have set in my cluster:
<property><name>mapreduce.reduce.shuffle.connect.timeout</name><value>180000</value></property>
<property><name>mapreduce.reduce.shuffle.read.timeout</name><value>180000</value></property>
<property><name>mapreduce.reduce.shuffle.maxfetchfailures</name><value>10</value></property>
<property><name>mapred.task.timeout</name><value>600000</value></property>
<property><name>mapred.jobtracker.blacklist.fault-timeout-window</name><value>180</value></property>
<property><name>mapred.healthChecker.script.timeout</name><value>600000</value></property>
BTW, this job runs on an AWS EMR cluster (Hadoop version: 0.20.205).
Thanks in advance.
Upvotes: 2
Views: 4471
Reputation: 5239
"Too many fetch failures" is actually quite common once you go past Hadoop 0.20 (which you have done). The issue seems to be related to a problem in the version of Jetty 6 that's bundled with later distributions of Hadoop. See MAPREDUCE-2386, MAPREDUCE-2529, MAPREDUCE-3851, MARREDUCE-3184.
Two things which seem to have helped me stop seeing this failure mode:
I believe that AMIs in the 2.3.x range use Jetty 7, so if you're inclined to upgrade to a later version of Hadoop (1.0.3), that should also help.
Upvotes: 1
Reputation: 30089
While i'm not sure for certain, what you're interested in understanding is implemented in the org.apache.hadoop.mapred.ReduceTask.ReduceCopier
class, specifically if you look at the source for the constructor for that class:
this.abortFailureLimit = Math.max(30, numMaps / 10);
this.maxFetchFailuresBeforeReporting = conf.getInt(
"mapreduce.reduce.shuffle.maxfetchfailures", REPORT_FAILURE_LIMIT);
this.maxFailedUniqueFetches = Math.min(numMaps,
this.maxFailedUniqueFetches);
You'll notice this is once of the configuration values you have already listed - mapreduce.reduce.shuffle.maxfetchfailures
. Have you tried setting this to a smaller value (1 or 0), does this produce the desired functionality?
You can also lower the connection timeout with mapreduce.reduce.shuffle.connect.timeout
(again you've also got this in your question). Try and lower the value to cause a connection timeout to be thrown sooner (180000 is 3 mins, try 30000 instead).
Sorry this isn't definitive, but a place to start at least.
Upvotes: 1