scetoaux
scetoaux

Reputation: 398

How does one make a hadoop task attempt to fail after too many data fetch failures?

I have a hadoop reduce task attempt that will never fail or get completed unless I fail/kill it manually.

The problem surfaces when the task tracker node (due to network issues that I am still investigating) looses connectivity with other task trackers/data nodes, but not with the job tracker.

Basically, the reduce task is not able to fetch the necessary data from other data nodes due to time out issues and it blacklists them. So far, so good, the blacklisting is expected and needed, the problem is that it will keep retry the same blacklisted hosts for hours (honoring what it seems to be an exponential back-off algorithm) until I manually kill it. Latest long running task had been >9 hours retrying.

I see hundreds of messages like these in the log:

2013-09-09 22:34:47,251 WARN org.apache.hadoop.mapred.ReduceTask (MapOutputCopier attempt_201309091958_0004_r_000044_0.1): attempt_201309091958_0004_r_000044_0 copy failed: attempt_201309091958_0004_m_001100_0 from X.X.X.X
2013-09-09 22:34:47,252 WARN org.apache.hadoop.mapred.ReduceTask (MapOutputCopier attempt_201309091958_0004_r_000044_0.1): java.net.SocketTimeoutException: connect timed out

Is there any way or setting to specify that after n retries or seconds the task should fail on and get restarted on its own in another task tracker host?

These are some of the relevant reduce/timeout Hadoop cluster parameters I have set in my cluster:

<property><name>mapreduce.reduce.shuffle.connect.timeout</name><value>180000</value></property>
<property><name>mapreduce.reduce.shuffle.read.timeout</name><value>180000</value></property>
<property><name>mapreduce.reduce.shuffle.maxfetchfailures</name><value>10</value></property>

<property><name>mapred.task.timeout</name><value>600000</value></property>
<property><name>mapred.jobtracker.blacklist.fault-timeout-window</name><value>180</value></property>
<property><name>mapred.healthChecker.script.timeout</name><value>600000</value></property>

BTW, this job runs on an AWS EMR cluster (Hadoop version: 0.20.205).

Thanks in advance.

Upvotes: 2

Views: 4471

Answers (2)

Judge Mental
Judge Mental

Reputation: 5239

"Too many fetch failures" is actually quite common once you go past Hadoop 0.20 (which you have done). The issue seems to be related to a problem in the version of Jetty 6 that's bundled with later distributions of Hadoop. See MAPREDUCE-2386, MAPREDUCE-2529, MAPREDUCE-3851, MARREDUCE-3184.

Two things which seem to have helped me stop seeing this failure mode:

  1. look for the patched version of Jetty 6 by Todd Lipcon from Cloudera and use a bootstrap action to replace the default one from AWS with the patched binaries
  2. Increase somaxconns from its default of 128 to something like 16384 with a bootstrap action and also set the ipc.server.listen.queue.size to the same value using the configure Hadoop bootstrap action.

I believe that AMIs in the 2.3.x range use Jetty 7, so if you're inclined to upgrade to a later version of Hadoop (1.0.3), that should also help.

Upvotes: 1

Chris White
Chris White

Reputation: 30089

While i'm not sure for certain, what you're interested in understanding is implemented in the org.apache.hadoop.mapred.ReduceTask.ReduceCopier class, specifically if you look at the source for the constructor for that class:

this.abortFailureLimit = Math.max(30, numMaps / 10);

this.maxFetchFailuresBeforeReporting = conf.getInt(
      "mapreduce.reduce.shuffle.maxfetchfailures", REPORT_FAILURE_LIMIT);

this.maxFailedUniqueFetches = Math.min(numMaps, 
                                       this.maxFailedUniqueFetches);

You'll notice this is once of the configuration values you have already listed - mapreduce.reduce.shuffle.maxfetchfailures. Have you tried setting this to a smaller value (1 or 0), does this produce the desired functionality?

You can also lower the connection timeout with mapreduce.reduce.shuffle.connect.timeout (again you've also got this in your question). Try and lower the value to cause a connection timeout to be thrown sooner (180000 is 3 mins, try 30000 instead).

Sorry this isn't definitive, but a place to start at least.

Upvotes: 1

Related Questions