akash sharma
akash sharma

Reputation: 461

How to optimise a query on a large dataset?

My original Query -

CREATE TABLE admin.FctPrfitAmt_rpt AS 
SELECT rcn.* FROM 
(SELECT t1.* FROM (SELECT * FROM admin.FctPrfitAmt t2 WHERE t2.scenario_id NOT IN(SELECT DISTINCT t3.scenario_id FROM admin.FctPrfitAmt_incr t3)
UNION ALL
SELECT * FROM admin.FctPrfitAmt_incr) t1) rcn;

The issue is that currently this query takes lot of time as the number of records involved are many.

Is there a way to tune this query?

I tried this approach but it does not work -

CREATE TABLE admin.FctPrfitAmt_rpt AS
SELECT * FROM admin.FctPrfitAmt t2 
WHERE t2.scenario_id NOT exists (SELECT 1 from  admin.FctPrfitAmt_incr t3 where t2.scenario_id = t3.scenario_id)
UNION ALL
SELECT * FROM admin.FctPrfitAmt_incr 

Error - looks like "not exists" is not supported in my version of Hive, so for my approach, I got the below error:

Error while compiling statement: FAILED: ParseException line 3:25 cannot recognize input near 'NOT' 'exists' '(' in expression specification

Upvotes: 3

Views: 167

Answers (3)

David דודו Markovitz
David דודו Markovitz

Reputation: 44941

select  *

from   (select  *
               ,max(tab) over (partition by scenario_id) as max_tab

        from    (           select *,1 as tab from master.FctPrfitAmt
                union all   select *,2 as tab from master.FctPrfitAmt_incr
                ) t
        ) t

where   tab     = 2
     or max_tab = 1
;

If all your data is consist of primitive types (no arrays, maps etc.),
you can use the following query:

select  inline(array(original_rowset))

from   (select  original_rowset
               ,tab
               ,max(tab) over (partition by scenario_id) as max_tab

        from    (           select struct(*) as original_rowset,scenario_id,1 as tab from FctPrfitAmt
                union all   select struct(*) as original_rowset,scenario_id,2 as tab from FctPrfitAmt_incr
                ) t
        ) t

where   tab     = 2
     or max_tab = 1

Upvotes: 0

David דודו Markovitz
David דודו Markovitz

Reputation: 44941

  1. Your syntax is wrong. NOT EXISTS should not be preceded by t2.scenario_id
  2. As we can see, scenario_id is skewed on both tables, wich creates a huge product on joins.

select  * 
from    admin.FctPrfitAmt   pa
where   not exists 
        (
            select  null

            from   (select  distinct 
                            pfa.scenario_id 

                    from    admin.FctPrfitAmt_incr  pfa
                    ) pfa

            where   pfa.scenario_id = 
                    pa.scenario_id
        )

union all

select  * 
from    admin.FctPrfitAmt_incr 

Upvotes: 2

Paul Back
Paul Back

Reputation: 1319

You're better off left joining the 2 tables in the 'select in' portion and filtering out rows that aren't null on the join key.

CREATE TABLE admin.FctPrfitAmt_rpt AS 
    SELECT rcn.* FROM 
    (
        SELECT t1.* 
        FROM admin.FctPrfitAmt t1 
        LEFT JOIN admin.FctPrfitAmt_incr t2
        ON t1.scenario_id = t2.scenario_id
        WHERE t2.scenario_id IS NULL
        UNION ALL
        SELECT * FROM admin.FctPrfitAmt_incr
    ) rcn
;

Upvotes: 2

Related Questions