Reputation: 1447
Recently I was reading the book, hadoop: the definitive guide which the part is two clusters copy data using distcp, and I saw the comment: "When data size is very large, it becomes necessary to limit the number of maps in order to limit bandwidth and cluster utilization"
I cannot get the meaning why? I think we should utilize the bandwidth as wide as possible to increase the efficiency of cluster. So why should we limit the number of maps?
Upvotes: 2
Views: 260
Reputation: 34184
Of course having more no. of mappers helps us to achieve higher parallelism, but it starts becoming a bottleneck if it is too high. For example, if you have mappers much more than the no. of CPU slots available on your slaves, most of the mappers will be in wait state. Likewise you may run out of memory and may face network congestion. Also, it'll take more time to create those many InputSplits and create so many maps. So, the no of mappers should be considerably high. Not too high, not too low. Actually framework does that for you under normal circumstances so that you don't have to worry. But sometimes you might want to do it on your own as per your requirements, but keeping the above said things in mind.
HTH
Upvotes: 1