john
john

Reputation: 35390

How should I set the distkey for a left join with conditionals in Redshift?

I have a query that looks like this:

select
   a.col1,
   a.col2,
   b.col3
from
   a 
   left join b on (a.id=b.id and b.attribute_id=3)
   left join c on (a.id=c.id and c.attribute_id=4)

Even setting the distkey to id gets me a DS_BCAST_INNER in the query plan and I end up with extraordinary query time for a mere 1 million rows.

Upvotes: 0

Views: 621

Answers (1)

Tony Gibbs
Tony Gibbs

Reputation: 2489

Setting the id to be the distribution key should co-locate the data and remove the need for the broadcast.

create table a (id int distkey, attribute_id int, col1 varchar(10), col2 varchar(10));
create table b (id int distkey, attribute_id int, col3 varchar(10));
create table c (id int distkey, attribute_id int);

You should see an explain plan something like this:

admin@dev=# explain select
       a.col1,
       a.col2,
       b.col3
    from
       a 
       left join b on (a.id=b.id and b.attribute_id=3)
       left join c on (a.id=c.id and c.attribute_id=4);
                                    QUERY PLAN                                
    --------------------------------------------------------------------------
     XN Hash Left Join DS_DIST_NONE  (cost=0.09..0.23 rows=3 width=99)
       Hash Cond: ("outer".id = "inner".id)
       ->  XN Hash Left Join DS_DIST_NONE  (cost=0.05..0.14 rows=3 width=103)
             Hash Cond: ("outer".id = "inner".id)
             ->  XN Seq Scan on a  (cost=0.00..0.03 rows=3 width=70)
             ->  XN Hash  (cost=0.04..0.04 rows=3 width=37)
                   ->  XN Seq Scan on b  (cost=0.00..0.04 rows=3 width=37)
                         Filter: (attribute_id = 3)
       ->  XN Hash  (cost=0.04..0.04 rows=1 width=4)
             ->  XN Seq Scan on c  (cost=0.00..0.04 rows=1 width=4)
                   Filter: (attribute_id = 4)
    (11 rows)

    Time: 123.315 ms

If the tables contain 3 million rows or less and have a low frequency of writes it should be safe to use DIST STYLE ALL. If you do use DIST STYLE KEY, verify that distributing your tables does not cause row skew (check with the following query):

select "schema", "table", skew_rows from svv_table_info;

"skew_rows" is the ratio of data between the slice with the most and the least data. It should be close 1.00.

Upvotes: 1

Related Questions