RFT
RFT

Reputation: 1071

Unusual Hadoop error - tasks get killed on their own

When I run my hadoop job I get the following error:

Request received to kill task 'attempt_201202230353_23186_r_000004_0' by user Task has been KILLED_UNCLEAN by the user

The logs appear to be clean. I run 28 reducers, and this doesnt happen for all the reducers. It happens for a selected few and the reducer starts again. I fail to understand this. Also other thing I have noticed is that for a small dataset, I rarely see this error!

Upvotes: 6

Views: 5703

Answers (2)

topstair
topstair

Reputation: 41

There are three things to try:

Setting a Counter
If Hadoop sees a counter for the job progressing then it won't kill it (see Arockiaraj Durairaj's answer.) This seems to be the most elegant as it could allow you more insight into long running jobs and were the hangups may be.

Longer Task Timeouts
Hadoop jobs timeout after 10 minutes by default. Changing the timeout is somewhat brute force, but could work. Imagine analyzing audio files that are generally 5MB files (songs), but you have a few 50MB files (entire album). Hadoop stores an individual file per block. So if your HDFS block size is 64MB then a 5MB file and a 50 MB file would both require 1 block (64MB) (see here http://blog.cloudera.com/blog/2009/02/the-small-files-problem/, and here Small files and HDFS blocks.) However, the 5MB job would run faster than the 50MB job. Task timeout can be increased in the code (mapred.task.timeout) for the job per the answers to this similar question: How to fix "Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds."

Increase Task Attempts
Configure Hadoop to make more than the 4 default attempts (see Pradeep Gollakota's answer). This is the most brute force method of the three. Hadoop will attempt the job more times, but you could be masking an underlying issue (small servers, large data blocks, etc).

Upvotes: 4

Arockiaraj Durairaj
Arockiaraj Durairaj

Reputation: 56

Can you try using counter(hadoop counter) in your reduce logic? It looks like hadoop is not able to determine whether your reduce program is running or hanging. It waits for a few minutes and kills it, even though your logic may be still executing.

Upvotes: 1

Related Questions