Rick Burgess
Rick Burgess

Reputation: 725

Unable to optimise Redshift query

I have build a system where data is loaded from s3 into redshift every few minutes (from a kinesis firehose). I then grab data from that main table and split it into a table per customer.

The main table has a few hundred million rows.

creating the subtable is done with a query like this:

create table {$table} as select * from {$source_table} where customer_id = '{$customer_id} and time between {$start} and {$end}'

I have keys defined as:

SORTKEY (customer_id, time)
DISTKEY customer_id

Everything I have read suggests this would be the optimal way to structure my tables/queries but the performance is absolutely awful. building the sub tables takes over a minute even with only a few rows to select.

Am I missing something or do I just need to scale the cluster?

Upvotes: 0

Views: 205

Answers (1)

Joe Harris
Joe Harris

Reputation: 14035

If you do not have a better key you may have to consider using DISTSTYLE EVEN, keeping the same sort key.

Ideally the distribution key should be a value that is used in joins and spreads your data evenly across the cluster. By using customer_id as the distribution key and then filtering on that key you're forcing all work to be done on just one slice.

To see this in action look in the system tables. First, find an example query:

SELECT * 
FROM stl_query 
WHERE userid > 1 
ORDER BY starttime DESC 
LIMIT 10;

Then, look at the bytes per slice for each step of you query in svl_query_report:

SELECT * 
FROM svl_query_report 
WHERE query = <your query id> 
ORDER BY query,segment,step,slice;

For a very detailed guide on designing the best table structure have a look at our "Amazon Redshift Engineering’s Advanced Table Design Playbook"

Upvotes: 1

Related Questions