user3252809
user3252809

Reputation: 136

Apache Hive different queries union, on the same table, with different where conditions in each query?

I have a Hive table named "sales" with below structure:

id,ptype,amount,time,date
1,a,12,2240,2013-12-25
1,a,4,1830,2013-12-25
1,b,2,1920,2013-12-25
1,b,3,2023,2013-12-25
2,a,5,1220,2013-12-25
2,a,1,1320,2013-12-25

Below is my queries for different variables variable:

Q1: select id,sum(amount) as s_amt from sales group by id;
Q2: select id, sum(amount) as s_a_amt from sales where ptype='a' group by id;
Q3: select id, sum(amount) as s_b_amt from sales where ptype='b' group by id;

As far what I learned in Hive we can apply "union all" option only when we have same column name or query schema. Below is the end result what i want to achieve using Hive query:

id,s_amt,s_a_amt,s_b_amt
1,21,16,5
2,6,6,0

Below is one query that i tried and it executed successfully. But it will be a very painful task when you have to design the same query for more than 300 variables. Is there any efficient approach for the same task considering we have more than 300 variables? Appreciate your comments!

    select t.id,max(t.s_amt) as s_amt,max(t.s_a_amt) as s_a_amt, max(t.s_b_amt) as s_b_amt
  from
     (select s1.id,sum(amount) as s_amt,0 as s_a_amt,0 as s_b_amt from sales s1 group by id union all
     select s2.id, 0 as s_amt, sum(amount) as s_a_amt, 0 as s_b_amt from sales s2 where ptype='a' group by id union all
     select s3.id, 0 as s_amt,0 as s_a_amt, sum(amount) as s_b_amt from sales s3 where ptype='b' group by id) t
group by t.id;

Upvotes: 3

Views: 1812

Answers (3)

Arnaud Delanoue
Arnaud Delanoue

Reputation: 26

The solution should be:

select id, sum(amount) s_amt, 
        SUM (CASE WHEN ptype='a' THEN amount 
                       ELSE 0
        END) sum_a_amt,
        SUM (CASE WHEN ptype='b' THEN amount 
                       ELSE 0
        END) sum_b_amt
from sales 
group by id;

Please try it and tell me if it works, I cannot test it right now...

Upvotes: 1

Brian from Chicago
Brian from Chicago

Reputation: 11

Hive has recently added GROUPING SETS as a new feature (https://issues.apache.org/jira/browse/HIVE-3471). It could be a lot easier (to write or read) than MQT. But not everyone knows about this feature and the use of CASE functions, as Arnaud has illustrated, is more commonly used in practice.

Upvotes: 1

Prithiviraj Damodaran
Prithiviraj Damodaran

Reputation: 73

Ideal Solution would be to have a

Materialised Query Table (MQT) as IBM refers.

Summary tables are special form of MQTs and thats exactly what you need. Quick definition - as the name suggests MQT is a simple summary table, materialized on the disk.

With MQT support all you have to do is the below

CREATE MATERIALISED QUERY TABLE MQTA AS (
select id, sum(amount) as s_a_amt from sales where ptype='a' group by id;
)
Data initially deferred 
Refresh deferred
Maintained by User

Data initially deferred says not insert summary records into the summary table. Refresh deferred says data in the table can be refreshed at any time using the REFRESH TABLE statement. Maintained by user says the Refersh of this table has to be taken care by the user - Maintained by System is another option in which system takes care of automatically updating the summary table when the base table sees inserts/deletes//updates.

You could directly query the MQT like a simple select query, all the heavy lifting of summarising records would have actually ran before and not when you query the MQT so it would much faster.

But AFAIK HIVE doesn’t support MQT or summary tables.

You now know the concept, you just have to simply simulate this.Create a summary table and insert the summary records (The REFRESH TABLE concept). You have to load summary values periodically by controlling with some kind of last load date fields so you will pickup only the records after last refresh.You can do this with scheduled jobs - Hive scripts.

INSERT INTO PTYPE_AMOUNT_MQT AS (
select *
  from
     (select s1.id,sum(amount) as s_amt,0 as s_a_amt,0 as s_b_amt from sales s1 where record_create_date > last_Refresh_date group by id union all
     select s2.id, 0 as s_amt, sum(amount) as s_a_amt, 0 as s_b_amt from sales s2 where ptype='a' and record_create_date > last_Refresh_date  group by id union all
     select s3.id, 0 as s_amt,0 as s_a_amt, sum(amount) as s_b_amt from sales s3 where ptype='b'  and record_create_date > last_Refresh_date group by id) 
)

It is always good to have audit fields like record_create_date and time.The last_Refresh_date is the last time your job ran

Upvotes: 2

Related Questions