Mason
Mason

Reputation: 8846

Can Redshift use the results of a subquery to filter by sortkey?

I have a table in Redshift with a few billion rows which looks like this

CREATE TABLE channels AS (
 fact_key TEXT NOT NULL distkey
 job_key BIGINT
 channel_key TEXT NOT NULL
)
diststyle key
compound sortkey(job_key, channel_key);

When I query by job_key + channel_key my seq scan is properly restricted by the full sortkey if I use specific values for channel_key in my query.

EXPLAIN
SELECT * FROM channels scd
WHERE scd.job_key = 1 AND scd.channel_key IN ('1234', '1235', '1236', '1237')

XN Seq Scan on channels scd  (cost=0.00..3178474.92 rows=3428929 width=77)
  Filter: ((((channel_key)::text = '1234'::text) OR ((channel_key)::text = '1235'::text) OR ((channel_key)::text = '1236'::text) OR ((channel_key)::text = '1237'::text)) AND (job_key = 1))

However if I query against channel_key by using IN + a subquery Redshift does not use the sortkey.

EXPLAIN
SELECT * FROM channels scd
WHERE scd.job_key = 1 AND scd.channel_key IN (select distinct channel_key from other_channel_list where job_key = 14 order by 1)

XN Hash IN Join DS_DIST_ALL_NONE  (cost=3.75..3540640.36 rows=899781 width=77)
  Hash Cond: (("outer".channel_key)::text = ("inner".channel_key)::text)
  ->  XN Seq Scan on channels scd  (cost=0.00..1765819.40 rows=141265552 width=77)
        Filter: (job_key = 1)
  ->  XN Hash  (cost=3.75..3.75 rows=1 width=402)
        ->  XN Subquery Scan "IN_subquery"  (cost=0.00..3.75 rows=1 width=402)
              ->  XN Unique  (cost=0.00..3.74 rows=1 width=29)
                    ->  XN Seq Scan on other_channel_list  (cost=0.00..3.74 rows=1 width=29)
                          Filter: (job_key = 14)

Is it possible to get this to work? My ultimate goal is to turn this into a view so pre-defining my list of channel_keys won't work.

Edit to provide more context:

This is part of a larger query and the results of this get hash joined to some other data. If I hard-code the channel_keys then the input to the hash join is ~2 million rows. If I use the IN condition with the subquery (nothing else changes) then the input to the hash join is 400 million rows. The total query time goes from ~40 seconds to 15+ minutes.

Upvotes: 0

Views: 742

Answers (1)

systemjack
systemjack

Reputation: 2985

Does this give you a better plan than the subquery version?

with other_channels as (
    select distinct channel_key from other_channel_list where job_key = 14 order by 1
)
SELECT *
FROM channels scd
JOIN other_channels ocd on scd.channel_key = ocd.channel_key
WHERE scd.job_key = 1

Upvotes: 0

Related Questions