Amazon Redshift doing Hash Join even when joined on column that is both Dist Key and Sort Key

Question

I have a fact table in Redshift having about 1.3 Billion rows with DISTribution key c1 and sort key c1, c2.

I need to join this table with itself with a join clause on c1 (i.e. c1 from 1st instance of table = c1 from 2nd instance of table).

As I see query plan of my query, Redshift appears to be doing a Hash Join with DS_DIST_NONE. Though DS_DIST_NONE is expected as I have both dist key and sort key on the column c1, but I expected Redshift to do a Merge Join instead of Hash Join (again because of the same reason).

I believe this is slowing down my query.

Can anyone please explain as to why Redshift may be doing a Hash Join instead of Merge Join (even though I have both DIST Key and SORT key on the joining column) and Redshift is doing DS_DIST_NONE for the query?

Vijay Kansal · Accepted Answer

It turned out that since we were not inserting data in our table in sorted order (as defined by Sort Key of that table) and Redshift does not automatically keep table's rows sorted by Sort Key, there was no way for Redshift to perform Merge Join on our table. After running a Full Vacuum on table, Redshift started performing Merge Join

Amazon Redshift doing Hash Join even when joined on column that is both Dist Key and Sort Key

Answers (1)

Related Questions