CodingInCircles
CodingInCircles

Reputation: 2807

Redshift - Efficient JOIN clause with OR

I have the need to join a huge table (10 million plus rows) to a lookup table (15k plus rows) with an OR condition. Something like:

SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 ON t1.c = t2.c OR t1.d = t2.d;

This is because table1 can have c or d as NULL, and I'd like to join on whichever is available, leaving out the rest. The query plan says there is a Nested Loop, which I realize is because of the OR condition. Is there a clean, efficient way of solving this problem? I'm using Redshift.

EDIT: I am trying to run this with a UNION, but it doesn't seem to be any faster than before.

Upvotes: 3

Views: 5281

Answers (2)

systemjack
systemjack

Reputation: 2985

How about doing two (left) joins? With the small lookup table performance shouldn't be too bad even.

SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t3.d)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.d = t2.d and t1.c is null
LEFT JOIN table2 t3 ON t1.c = t3.c and t1.d is null

Your original query only returns rows that match at least one of c or d in the lookup table. If that's not guaranteed you may need to add filters...for example rows in t1 where both c and d are null or have values not present in table2.

Don't really need the null checks in the joins, but might be slightly faster.

Upvotes: 0

Joe Harris
Joe Harris

Reputation: 14035

If you have a preferred column you can NVL() (aka COALESCE()) them and join on that.

SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 
  ON t1.c = NVL(t2.c,t2.d);

I'd also suggest that you should set the lookup table to DISTSTYLE ALL to ensure that the larger table is not redistributed.

[ Also, 10 million rows isn't big for Redshift. Not trying to be snotty just saying that we get excellent performance on Redshift even when querying (and joining) tables with hundreds of billions of rows. ]

Upvotes: 1

Related Questions