Reputation: 125
I need to perform inner-join on tables with two common columns org_id and time_stamp on data in avro format in S3 queried through Athena
I have tried
SELECT year(from_iso8601_timestamp(em.time_stamp)) time_unit,
sum(em.column1) column1,
sum(spa.column2) column2,
sum(vir.column3) column3
FROM "schemaName".table1 em
JOIN "schemaName".table2 spa
ON year(from_iso8601_timestamp(em.time_stamp)) = year(from_iso8601_timestamp(spa.time_stamp))
AND em.org_id = spa.org_id
JOIN "schemaName".table3 vir
ON year(from_iso8601_timestamp(vir.time_stamp)) = year(from_iso8601_timestamp(spa.time_stamp))
AND vir.org_id = spa.org_id
WHERE em.org_id = 'org_id_test'
AND (from_iso8601_timestamp(em.time_stamp)) <= (cast(from_iso8601_timestamp('2019-11-22T23:59:31') AS timestamp))
AND (from_iso8601_timestamp(em.time_stamp)) >= (cast(from_iso8601_timestamp('2019-11-22T23:59:31') AS timestamp) - interval '10' year)
GROUP BY em.org_id, year(from_iso8601_timestamp(em.time_stamp))
ORDER BY time_unit DESC limit 11
But what I am getting is kind of looking as cross-join
results
time_unit |column1 |column2 |column3
1 2019 |48384 |299040 |712
while if I aggregate on each table separately with same where conditions, then values appear as
table1
column1
504
table2
column2
280
table3
column3
5
can somebody help me figure out what I am doing wrong and right way to achieve it?
Upvotes: 0
Views: 54
Reputation: 222442
If I followed you correctly, what is happening is that, since there are multiple records matching the conditions in each join, you end up the same record being counted multiple time when you aggregate.
A typical way around this is to aggregate in subqueries, and then join.
Something like this might be what you are looking for:
select
em.time_unit,
em.column1,
spa.column2,
vir.column3
from (
select
org_id,
year(from_iso8601_timestamp(time_stamp)) time_unit,
sum(column1) column1
from "schemaname".table1
group by org_id, year(from_iso8601_timestamp(time_stamp))
) em
join (
select
org_id,
year(from_iso8601_timestamp(time_stamp)) time_unit,
sum(column2) column2
from "schemaname".table2
group by org_id, year(from_iso8601_timestamp(time_stamp))
) spa on spa.time_unit = em.time_unit and spa.org_id = em.org_id
join (
select
org_id,
year(from_iso8601_timestamp(time_stamp)) time_unit,
sum(column3) column3
from "schemaname".table3
group by org_id, year(from_iso8601_timestamp(time_stamp))
) vir on vir.time_unit = em.time_unit and vir.org_id = em.org_id
where
em.org_id = 'org_id_test'
and em.time_unit between 2009 and 2019
order by em.time_unit desc
limit 11
Upvotes: 1