Shellong
Shellong

Reputation: 381

hadoop streaming job hanged at reduce side merge stage

I write a hadoop streaming job, that uses python code to transform the data.But the job occurred some error.when the input file is larger(e.g. 70M bytes), it will hange on the reduce stage.When I decrease the input file into smaller(e.g. 700kb), it runs successfully. below is some logs:

Reducer container logs:
2024-08-23 10:08:26,080 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : finalMerge called with 11 in-memory map-outputs and 0 on-disk map-outputs
2024-08-23 10:08:26,086 INFO [main] org.apache.hadoop.mapred.Merger : Merging 11 sorted segments
2024-08-23 10:08:26,087 INFO [main] org.apache.hadoop.mapred.Merger : Down to the last merge-pass, with 10 segments left of total size : 79232287 bytes
2024-08-23 10:08:26,466 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : Merged 11 segments, 79233409 bytes to disk to satisfy reduce memory limit
2024-08-23 10:08:26,469 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : Merging 1 files, 13421702 bytes from disk
2024-08-23 10:08:26,472 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : Merging 0 segments, 0 bytes from memory into reduce
2024-08-23 10:08:26,472 INFO [main] org.apache.hadoop.mapred.Merger : Merging 1 sorted segments
2024-08-23 10:08:26,480 INFO [main] org.apache.hadoop.mapred.Merger : Down to the last merge-pass, with 1 segments left of total size : 79233279 bytes

application master logs:
24/08/05 10:08:51 INFO mapreduce.Job: map 100% reduce 100%
24/08/05 10:29:19 INFO mapreduce.Job: Task Id attempt_XXXXXX, Status: FAILED
AttemptID : attempt_XXXXXX Timed out after 1200 secs

I also check the counter:

Map input records: 703,640 (This is correct)
Map output records: 685,583
Reduce input records : 685,583 (not correct)
Custom Counter From Code-ReduceInputRecords :  685,489 (this is counted in code)

It shows that in Reduce side the true value of received record is 685,489 , not the counter value 685,583 that is counted by hadoop. the code should be hanged in the sys.stdin code line. Is there anyone who knows why?

Upvotes: 1

Views: 27

Answers (1)

Shellong
Shellong

Reputation: 381

I find the key reason that is the reduce task is so complicated and spent so much time executing.

I add some status and counter report for my reducer task. So I find the key reason is that there is a for loop in the reduce has a complexity of N*N,so it spend so many time on my reducer task.

Upvotes: 0

Related Questions