Tom
Tom

Reputation: 6342

How to control the maximum number of containers that one Hive query kicks off at the same time

I have a Hive table that has 1000 files on HDFS, each file is about 128M (one HDFS block is 128M). When I run select count(1) from this table, it will run 1000 mappers in total, which is OK.

What makes things bad is that this Hive query will try to kick off as many mappers as possible at the same time given the cluster resources are available(of course 1000 at most).

This is really bad and ugly, because it may occupy too many resources at the same time, leave other applications no resources to use and have to wait.

My question is how to control the maximum mappers that runs simultaneously?

That is, eg, for 1000 mappers, at any moment, there are 100 mappers at most running at the same time, so that it will not occupy too many resources at the same time(Spark is has such control with --num-executors and --executor-cores parameters)

Upvotes: 2

Views: 2162

Answers (1)

Tom
Tom

Reputation: 6342

As of Hadoop 2.7.0, MapReduce provides two configuration options to achieve this:

  • mapreduce.job.running.map.limit (default: 0, for no limit)
  • mapreduce.job.running.reduce.limit (default: 0, for no limit)

MAPREDUCE-5583: Ability to limit running map and reduce tasks

Upvotes: 2

Related Questions