Unable to optimise Redshift query

Question

I have build a system where data is loaded from s3 into redshift every few minutes (from a kinesis firehose). I then grab data from that main table and split it into a table per customer.

The main table has a few hundred million rows.

creating the subtable is done with a query like this:

create table {$table} as select * from {$source_table} where customer_id = '{$customer_id} and time between {$start} and {$end}'

I have keys defined as:

SORTKEY (customer_id, time)
DISTKEY customer_id

Everything I have read suggests this would be the optimal way to structure my tables/queries but the performance is absolutely awful. building the sub tables takes over a minute even with only a few rows to select.

Am I missing something or do I just need to scale the cluster?

Joe Harris · Accepted Answer

If you do not have a better key you may have to consider using DISTSTYLE EVEN, keeping the same sort key.

Ideally the distribution key should be a value that is used in joins and spreads your data evenly across the cluster. By using customer_id as the distribution key and then filtering on that key you're forcing all work to be done on just one slice.

To see this in action look in the system tables. First, find an example query:

SELECT * 
FROM stl_query 
WHERE userid > 1 
ORDER BY starttime DESC 
LIMIT 10;

Then, look at the bytes per slice for each step of you query in svl_query_report:

SELECT * 
FROM svl_query_report 
WHERE query =  
ORDER BY query,segment,step,slice;

For a very detailed guide on designing the best table structure have a look at our "Amazon Redshift Engineering’s Advanced Table Design Playbook"

Unable to optimise Redshift query

Answers (1)

Related Questions