keepkimi
keepkimi

Reputation: 393

Why increasing instances number doesn't increase Hive query speed

I created a table using Hive in Amazon's Elastic MapReduce, imported data to it and partitioned it. Now I run a query that counts the most frequent words from one of table fields.

I run that query when I had 1 master and 2 core instances and it took 180 seconds to compute. Then I reconfigured it to have 1 master and 10 cores and it also takes 180 seconds. Why not faster?

I have almost the same output when running on 2 cores and 10 cores:

Total MapReduce jobs = 2
Launching Job 1 out of 2

Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201208251929_0003, Tracking URL = http://ip-10-120-250-34.ec2.internal:9100/jobdetails.    jsp?jobid=job_201208251929_0003
Kill Command = /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.120.250.34:9001 -kill     job_201208251929_0003
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2012-08-25 19:38:47,399 Stage-1 map = 0%,  reduce = 0%
2012-08-25 19:39:00,482 Stage-1 map = 3%,  reduce = 0%
2012-08-25 19:39:03,503 Stage-1 map = 5%,  reduce = 0%
2012-08-25 19:39:06,523 Stage-1 map = 10%,  reduce = 0%
2012-08-25 19:39:09,544 Stage-1 map = 18%,  reduce = 0%
2012-08-25 19:39:12,563 Stage-1 map = 24%,  reduce = 0%
2012-08-25 19:39:15,583 Stage-1 map = 35%,  reduce = 0%
2012-08-25 19:39:18,610 Stage-1 map = 45%,  reduce = 0%
2012-08-25 19:39:21,631 Stage-1 map = 53%,  reduce = 0%
2012-08-25 19:39:24,652 Stage-1 map = 67%,  reduce = 0%
2012-08-25 19:39:27,672 Stage-1 map = 75%,  reduce = 0%
2012-08-25 19:39:30,692 Stage-1 map = 89%,  reduce = 0%
2012-08-25 19:39:33,715 Stage-1 map = 94%,  reduce = 0%, Cumulative CPU 23.11 sec
2012-08-25 19:39:34,723 Stage-1 map = 94%,  reduce = 0%, Cumulative CPU 23.11 sec
2012-08-25 19:39:35,730 Stage-1 map = 94%,  reduce = 0%, Cumulative CPU 23.11 sec
2012-08-25 19:39:36,802 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:37,810 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:38,819 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:39,827 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:40,835 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:41,845 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:42,856 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:43,865 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:44,873 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:45,882 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:46,891 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:47,900 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:48,908 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:49,916 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:50,924 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 62.57 sec
2012-08-25 19:39:51,934 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 62.57 sec
2012-08-25 19:39:52,942 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 62.57 sec
2012-08-25 19:39:53,950 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 62.57 sec
2012-08-25 19:39:54,958 Stage-1 map = 100%,  reduce = 72%, Cumulative CPU 62.57 sec
2012-08-25 19:39:55,967 Stage-1 map = 100%,  reduce = 72%, Cumulative CPU 62.57 sec
2012-08-25 19:39:56,976 Stage-1 map = 100%,  reduce = 72%, Cumulative CPU 62.57 sec
2012-08-25 19:39:57,990 Stage-1 map = 100%,  reduce = 90%, Cumulative CPU 62.57 sec
2012-08-25 19:39:59,001 Stage-1 map = 100%,  reduce = 90%, Cumulative CPU 62.57 sec
2012-08-25 19:40:00,011 Stage-1 map = 100%,  reduce = 90%, Cumulative CPU 62.57 sec
2012-08-25 19:40:01,022 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 72.86 sec
2012-08-25 19:40:02,031 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 72.86 sec
2012-08-25 19:40:03,041 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 72.86 sec
2012-08-25 19:40:04,051 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 72.86 sec
2012-08-25 19:40:05,060 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 72.86 sec
2012-08-25 19:40:06,070 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 72.86 sec
2012-08-25 19:40:07,079 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 72.86 sec
MapReduce Total cumulative CPU time: 1 minutes 12 seconds 860 msec
Ended Job = job_201208251929_0003
Counters:
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201208251929_0004, Tracking URL = http://ip-10-120-250-34.ec2.internal:9100/jobdetails.    jsp?jobid=job_201208251929_0004
Kill Command = /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.120.250.34:9001 -kill     job_201208251929_0004
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2012-08-25 19:40:30,147 Stage-2 map = 0%,  reduce = 0%
2012-08-25 19:40:43,241 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:44,254 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:45,262 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:46,272 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:47,282 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:48,290 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:49,298 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:50,306 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:51,315 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:52,323 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:53,331 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:54,339 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 7.48 sec
2012-08-25 19:40:55,347 Stage-2 map = 100%,  reduce = 33%, Cumulative CPU 7.48 sec
2012-08-25 19:40:56,357 Stage-2 map = 100%,  reduce = 33%, Cumulative CPU 7.48 sec
2012-08-25 19:40:57,365 Stage-2 map = 100%,  reduce = 33%, Cumulative CPU 7.48 sec
2012-08-25 19:40:58,374 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 10.85 sec
2012-08-25 19:40:59,384 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 10.85 sec
2012-08-25 19:41:00,393 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 10.85 sec
2012-08-25 19:41:01,407 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 10.85 sec
2012-08-25 19:41:02,420 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 10.85 sec
2012-08-25 19:41:03,431 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 10.85 sec
2012-08-25 19:41:04,443 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 10.85 sec
MapReduce Total cumulative CPU time: 10 seconds 850 msec
Ended Job = job_201208251929_0004
Counters:
MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Accumulative CPU: 72.86 sec   HDFS Read: 4920 HDFS Write: 8371374 SUCCESS
Job 1: Map: 1  Reduce: 1   Accumulative CPU: 10.85 sec   HDFS Read: 8371850 HDFS Write: 456 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 23 seconds 710 msec

Upvotes: 0

Views: 1228

Answers (2)

VeLKerr
VeLKerr

Reputation: 3157

I think, you should increase amount of reducers, on which you query executes. It is done by the following code:

set mapred.reduce.tasks=n;

where n is amount of reducers.

Then use DISTRIBUTE BY or CLUSTER BY clause (don't confuse with CLUSTERED BY) to distribute parts of the dataset as uniformly as possible between reducers. If you don't need sorting, better use DISTRIBUTE BY because

Cluster By is a short-cut for both Distribute By and Sort By.

Here is the link to hive manual.

Upvotes: 0

David Gruzman
David Gruzman

Reputation: 8088

You have only one reducer - and it is doing most of the work. I think it is a reason.

Upvotes: 1

Related Questions