Reputation: 553
I have two tables. TABLE1 has columns:
pers_key
cost
visit
TABLE2 has columns:
pers_key
months
First, I create a temporary table:
CREATE TABLE temp_table as
SELECT pers_key,SUM(cost) AS sum_cost, COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key;
Then, I create TABLE3:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
B.sum_cost/A.months AS ind1,
B.visit_count/A.months AS ind2
FROM TABLE2 AS A, temp_table AS B
WHERE A.pers_key = B.pers_key
I'm wondering if there is better way to achieve the same result here. Would it be possible to do this in one query without creating temp_table altogether? Perhaps something like:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
(SUM(B.cost)over (partition by B.pers_key))/A.months AS ind1,
(COUNT(B.visit)over (partition by B.pers_key))/A.months AS ind2
FROM TABLE2 AS A, TABLE1 AS B
WHERE A.pers_key = B.pers_key
Or is the temporary table required to achieve the desired result set?
Upvotes: 0
Views: 74
Reputation: 1271003
How about just using a subquery?
SELECT A.pers_key,
B.sum_cost / A.months AS ind1,
B.visit_count / A.months AS ind2
FROM TABLE2 A JOIN
(SELECT pers_key, SUM(cost) AS sum_cost,
COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key
) B
ON A.pers_key = B.pers_key;
EDIT:
Your question is a bit complicated. This is definitely a reasonable approach. It may be faster to put the subquery in a table and build an index on the table for the join. However, a red flag is the count(distinct)
. In my experience with Hive, the following is faster than the above subquery:
(SELECT pers_key, SUM(sum_cost) AS sum_cost,
COUNT(visit) AS visit_count
FROM (SELECT pers_key, visit, SUM(cost) as sum_cost
FROM TABLE1
GROUP BY pers_key, visit
) t
GROUP BY pers_key
) B
It is a bit counter-intuitive (to me) that this version is faster. But, what happens is that the group by
is that Hive readily parallelizes the group by
s. On the other hand, the count(distinct)
is processed serially. This sometimes occurs in other databases (I've seen similar behavior in Postgres with count(distinct)
. And another caveat: I did not set up the Hive system where I discovered this, so it might be some sort of configuration issue.
Upvotes: 2