Reputation: 31
I am new to RedShift and just experimenting at this stage to help with table design.
We have a very simple table with about 6 million rows and 2 integer fields.
Both integer fields are in the sort key but the plan has a warning - "very selective query filter".
The STL_Alert_Event_Log entry is: 'Very selective query filter:ratio=rows(61)/rows_pre_user_filter(524170)=0.000116'
The query we are running is:
select count(*)
from LargeNumberofRowswithUniKey r
where r.benchmarkid = 291891 and universeid = 300901
Our Table DDL is:
CREATE TABLE public.LargeNumberofRowswithUniKey
(
benchmarkid INTEGER NOT NULL DISTKEY,
UniverseID INTEGER NOT NULL
)
SORTKEY
(
benchmarkid,UniverseID
);
We have also run the following commands on the table:
Vacuum full public.LargeNumberofRowswithUniKey;
Analyze public.LargeNumberofRowswithUniKey;
The screenshot of the plan is here: [Query Plan Image][1] My expectation was that the multiple sort key including Benchmark and Universe and the fact that both are part of the filter predicate would ensure that the design was optimal for the sample query. This does not seem to be the case, hence the red warning symbol in the attached image. Can anyone shed light on this?
Thanks
George
Update 2017/09/07 I have some more information that may help:
If I run a much simpler query which just filters on the first column of the sort key.
select r.benchmarkid
from LargeNumberofRowswithUniKey r
where r.benchmarkid = 291891
This results in 524,170 rows being scanned according to the actual query plan from the console. When I look at the blocks using STV_BLOCKLIST. The relevant blocks that might be required to satisfy my query are:
|slice|col|tbl |blocknum|num_values|minvalue|maxvalue|
| 1| 0|346457| 4| 262085| 291881| 383881|
| 3| 0|346457| 4| 262085| 291883| 344174|
| 0| 0|346457| 5| 262085| 291891| 344122|
So shouldn't there be 786,255 rows scanned (3 x 262,085) instead of 524,170 (2 x 262,085) as listed in the plan?
Upvotes: 3
Views: 3956
Reputation: 269171
Side-observation: If you are always selecting values by using both benchmarkid
and UniverseID
, you should probably use DISTKEY EVEN
.
The reason for this is that a benchmarkid DISTKEY
would distribute the data between slices based on benchmarkid
. All the values for a given benchmarkid
would be on the same slice. If your query always provides a benchmarkid
in the query, then the query only utilizes one slice.
On the other hand, if it used DISTKEY EVEN
, then every slice can participate in the query, making it more efficient (for queries with WHERE benchmarkid = xxx
).
A general rule of thumb is:
DISTKEY
for fields commonly used in JOIN or GROUP BYSORTKEY
for fields commonly used in WHEREUpvotes: 5
Reputation: 12756
The "very selective filter" warning is returned when the rows selected vs rows scanned ratio is less than 0.05 i.e. a relatively large number of rows are scanned compared to the number of rows actually returned. This can be caused by having a large number of unsorted rows in a table, which can be resolved by running a vacuum. However, as you're already doing that I think this is happening because your query is actually very selective (you're selecting a single combination of benchmarkid and universeid) and so you can probably ignore this warning.
Upvotes: 4