Reputation: 6342
I have a Hive table that has 1000 files on HDFS, each file is about 128M (one HDFS block is 128M). When I run select count(1) from this table, it will run 1000 mappers in total, which is OK.
What makes things bad is that this Hive query will try to kick off as many mappers as possible at the same time given the cluster resources are available(of course 1000 at most).
This is really bad and ugly, because it may occupy too many resources at the same time, leave other applications no resources to use and have to wait.
My question is how to control the maximum mappers that runs simultaneously?
That is, eg, for 1000 mappers, at any moment, there are 100 mappers at most running at the same time, so that it will not occupy too many resources at the same time(Spark is has such control with --num-executors and --executor-cores
parameters)
Upvotes: 2
Views: 2162
Reputation: 6342
As of Hadoop 2.7.0, MapReduce provides two configuration options to achieve this:
mapreduce.job.running.map.limit
(default: 0, for no limit)mapreduce.job.running.reduce.limit
(default: 0, for no limit)MAPREDUCE-5583: Ability to limit running map and reduce tasks
Upvotes: 2