Reputation: 271
We are working on a collection of hive queries for our business users and noticed that this specific query takes a lot of time:
SELECT t.country, t.site, t.year, a.name, COUNT(t.*)
from
(SELECT DISTINCT country, site, year, month FROM signals) t,
(SELECT DISTINCT site, country, name from master_data) a
WHERE t.site = a.site and t.country = a.country
GROUP BY t.country, t.site, t.year, a.name;
Each subselect by itself takes about 25 seconds. The query without grabbing the name takes 2 minutes, but as soon as the join comes in time is exploding.
Do you have an idea why the execution time increases that rapidly?
P.S. t
returns 90 entries and a
returns 263
Upvotes: 0
Views: 6356
Reputation: 1185
I suggest that you use EXPLAIN
or EXPLAIN EXTENDED
(more details see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain) to see how the query is being actually executed. Studying the execution plan can reveal the reason.
By default Hive uses Map Reduce as execution engine. You might try the execution engine to tez which is much more efficient when it comes to complex queries. Add the following line before the SELECT
statement.
set hive.execution.engine=tez;
Looking at your query. Without name
the right sub-select in the join (named a
) will return much less records when name
is not included. Most likely in a magnitude of 10.
Try to use the JOIN clause with specifying the join predicates in the join. The execution plan is different and much more optimal:
SELECT t.country, t.site, t.year, a.name, COUNT(t.*)
from
(SELECT DISTINCT country, site, year, month FROM signals) t,
INNER JOIN (SELECT DISTINCT country, site, name from master_data) a
ON t.site = a.site and t.country = a.country
GROUP BY t.country, t.site, t.year, a.name;
Try also the version without sub-selects:
SELECT t.country, t.site, t.year, a.name, COUNT(t.*)
FROM signals t,
INNER JOIN master_data a
ON t.site = a.site and t.country = a.country
GROUP BY t.country, t.site, t.year, a.name;
Upvotes: 3