Reputation: 135
I am running a simple query in redshift where the source table has 26 million records and the other table has 120k records. One issue is I don't have a direct relation between the two tables so I am using another foreign key which is common in both the table to make the join. When I gave a limit the query is completing within seconds, however when running without limit it is running forever
`select a.person_name, b.city from persons a left outer join address b on a.zip_code= b.zip_code`
`QUERY PLAN
XN Hash Left Join DS_BCAST_INNER (cost=1559.11..39246081357.77 rows=224197624453 width=15)
Hash Cond: ("outer".zip_code = "inner".zip_code)
-> XN Seq Scan on persons a (cost=0.00..269134.10 rows=26913410 width=16)
-> XN Hash (cost=1247.29..1247.29 rows=124729 width=15)
-> XN Seq Scan on address b (cost=0.00..1247.29 rows=124729 width=15)`
Any help would be appreciated.
Upvotes: 1
Views: 440
Reputation: 1269593
You probably have a bunch of zip codes that have duplicate values in both tables. You can calculate the size of the intermediate table from the JOIN
using:
select sum(p.cnt * a.cnt)
from (select zip_code, count(*) as cnt
from persons
group by zip_code
) p join
(select zip_code, count(*) as cnt
from address
group by zip_code
) a
on p.zip_code = a.zip_code;
You may find that this is actually generating billions and billions of rows -- if not more. That is why the query is taking so long.
You would need appropriate join
conditions to fix the problem.
Upvotes: 2