Piyush Tyagi
Piyush Tyagi

Reputation: 125

Mutli column and mutl table inner join

I need to perform inner-join on tables with two common columns org_id and time_stamp on data in avro format in S3 queried through Athena

I have tried

SELECT year(from_iso8601_timestamp(em.time_stamp)) time_unit,
         sum(em.column1) column1,
         sum(spa.column2) column2,
         sum(vir.column3) column3
FROM "schemaName".table1 em
JOIN "schemaName".table2 spa
    ON year(from_iso8601_timestamp(em.time_stamp)) = year(from_iso8601_timestamp(spa.time_stamp))
        AND em.org_id = spa.org_id
JOIN "schemaName".table3 vir
    ON year(from_iso8601_timestamp(vir.time_stamp)) = year(from_iso8601_timestamp(spa.time_stamp))
        AND vir.org_id = spa.org_id
WHERE em.org_id = 'org_id_test'
        AND (from_iso8601_timestamp(em.time_stamp)) <= (cast(from_iso8601_timestamp('2019-11-22T23:59:31') AS timestamp))
        AND (from_iso8601_timestamp(em.time_stamp)) >= (cast(from_iso8601_timestamp('2019-11-22T23:59:31') AS timestamp) - interval '10' year)
GROUP BY  em.org_id, year(from_iso8601_timestamp(em.time_stamp))
ORDER BY  time_unit DESC limit 11

But what I am getting is kind of looking as cross-join

results
    time_unit   |column1    |column2    |column3
1   2019        |48384      |299040     |712

while if I aggregate on each table separately with same where conditions, then values appear as

table1
column1

    504
table2
column2

    280
table3
column3

    5

can somebody help me figure out what I am doing wrong and right way to achieve it?

Upvotes: 0

Views: 54

Answers (1)

GMB
GMB

Reputation: 222442

If I followed you correctly, what is happening is that, since there are multiple records matching the conditions in each join, you end up the same record being counted multiple time when you aggregate.

A typical way around this is to aggregate in subqueries, and then join.

Something like this might be what you are looking for:

select 
    em.time_unit,
    em.column1,
    spa.column2,
    vir.column3
from (
    select 
        org_id,
        year(from_iso8601_timestamp(time_stamp)) time_unit,
        sum(column1) column1
    from "schemaname".table1
    group by org_id, year(from_iso8601_timestamp(time_stamp))
) em
join (
    select 
        org_id,
        year(from_iso8601_timestamp(time_stamp)) time_unit,
        sum(column2) column2
    from "schemaname".table2
    group by org_id, year(from_iso8601_timestamp(time_stamp))
) spa on spa.time_unit = em.time_unit and spa.org_id = em.org_id
join (
    select 
        org_id,
        year(from_iso8601_timestamp(time_stamp)) time_unit,
        sum(column3) column3
    from "schemaname".table3
    group by org_id, year(from_iso8601_timestamp(time_stamp))
) vir on vir.time_unit = em.time_unit and vir.org_id = em.org_id
where 
    em.org_id = 'org_id_test'
    and em.time_unit between 2009 and 2019
order by em.time_unit desc 
limit 11

Upvotes: 1

Related Questions