Combining two queries where one uses GROUP BY

Question

I have two tables. TABLE1 has columns:

pers_key
cost
visit

TABLE2 has columns:

pers_key
months

First, I create a temporary table:

CREATE TABLE temp_table as
SELECT pers_key,SUM(cost) AS sum_cost, COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key;

Then, I create TABLE3:

CREATE TABLE TABLE3 as
SELECT A.pers_key,
B.sum_cost/A.months AS ind1,
B.visit_count/A.months AS ind2
FROM TABLE2 AS A, temp_table AS B
WHERE A.pers_key = B.pers_key

I'm wondering if there is better way to achieve the same result here. Would it be possible to do this in one query without creating temp_table altogether? Perhaps something like:

CREATE TABLE TABLE3 as
SELECT A.pers_key,
(SUM(B.cost)over (partition by B.pers_key))/A.months AS ind1,
(COUNT(B.visit)over (partition by B.pers_key))/A.months AS ind2
FROM TABLE2 AS A, TABLE1 AS B
WHERE A.pers_key = B.pers_key

Or is the temporary table required to achieve the desired result set?

Gordon Linoff · Accepted Answer

How about just using a subquery?

SELECT A.pers_key,
       B.sum_cost / A.months AS ind1,
       B.visit_count / A.months AS ind2
FROM TABLE2 A JOIN
     (SELECT pers_key, SUM(cost) AS sum_cost,
             COUNT(DISTINCT visit) AS visit_count
      FROM TABLE1
      GROUP BY pers_key
     ) B
     ON A.pers_key = B.pers_key;

EDIT:

Your question is a bit complicated. This is definitely a reasonable approach. It may be faster to put the subquery in a table and build an index on the table for the join. However, a red flag is the count(distinct). In my experience with Hive, the following is faster than the above subquery:

     (SELECT pers_key, SUM(sum_cost) AS sum_cost,
             COUNT(visit) AS visit_count
      FROM (SELECT pers_key, visit, SUM(cost) as sum_cost
            FROM TABLE1
            GROUP BY pers_key, visit
           ) t
      GROUP BY pers_key
     ) B

It is a bit counter-intuitive (to me) that this version is faster. But, what happens is that the group by is that Hive readily parallelizes the group bys. On the other hand, the count(distinct) is processed serially. This sometimes occurs in other databases (I've seen similar behavior in Postgres with count(distinct). And another caveat: I did not set up the Hive system where I discovered this, so it might be some sort of configuration issue.

Combining two queries where one uses GROUP BY

Answers (1)

Related Questions