hadoop streaming job hanged at reduce side merge stage

Question

I write a hadoop streaming job, that uses python code to transform the data.But the job occurred some error.when the input file is larger(e.g. 70M bytes), it will hange on the reduce stage.When I decrease the input file into smaller(e.g. 700kb), it runs successfully. below is some logs:

Reducer container logs:
2024-08-23 10:08:26,080 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : finalMerge called with 11 in-memory map-outputs and 0 on-disk map-outputs
2024-08-23 10:08:26,086 INFO [main] org.apache.hadoop.mapred.Merger : Merging 11 sorted segments
2024-08-23 10:08:26,087 INFO [main] org.apache.hadoop.mapred.Merger : Down to the last merge-pass, with 10 segments left of total size : 79232287 bytes
2024-08-23 10:08:26,466 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : Merged 11 segments, 79233409 bytes to disk to satisfy reduce memory limit
2024-08-23 10:08:26,469 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : Merging 1 files, 13421702 bytes from disk
2024-08-23 10:08:26,472 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : Merging 0 segments, 0 bytes from memory into reduce
2024-08-23 10:08:26,472 INFO [main] org.apache.hadoop.mapred.Merger : Merging 1 sorted segments
2024-08-23 10:08:26,480 INFO [main] org.apache.hadoop.mapred.Merger : Down to the last merge-pass, with 1 segments left of total size : 79233279 bytes

application master logs:
24/08/05 10:08:51 INFO mapreduce.Job: map 100% reduce 100%
24/08/05 10:29:19 INFO mapreduce.Job: Task Id attempt_XXXXXX, Status: FAILED
AttemptID : attempt_XXXXXX Timed out after 1200 secs

I also check the counter:

Map input records: 703,640 (This is correct)
Map output records: 685,583
Reduce input records : 685,583 (not correct)
Custom Counter From Code-ReduceInputRecords :  685,489 (this is counted in code)

It shows that in Reduce side the true value of received record is 685,489 , not the counter value 685,583 that is counted by hadoop. the code should be hanged in the sys.stdin code line. Is there anyone who knows why?

hadoop streaming job hanged at reduce side merge stage

Answers (1)

Related Questions