Fran
Fran

Reputation: 11

Redshift DISTSTYLE KEY. Deciding whats the best column to define as KEY

Well I recently got into this area of Redshift, trying to optimize disk usage and performance of my database, and having read lots of information on AWS about the topic, I still have some doubts. First of all, to my database structure. Per schema, I have 3 master tables, with 3 different IDs, these are now DISTSTLYE ALL tables, being small in size.

Each master table has different amounts of IDs,

the date table --> largest one (#1 most joined)

the store table --> medium one (#3 most joined)

the item table --> smallest one (#2 most joined)

Then I have my core table, which has needed combinations of these IDs to display additional information about them. Anyway, this table should be a DISTSTYLE KEY type, based on my knowledge. Well, which of the 3 IDs should I select to be my DIST KEY?

Whats the criteria for this decision? I understand that for joins I need to look at the Sort Key, well that has been understood and defined to the ID_date, because its the most joined table. So now, what about the distribution per node of this table?

I'm sorry if I'm rambling, I dont want to leave any information out. If I have, feel free to ask! Thanks for taking the time to read!

Upvotes: 1

Views: 4792

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 269370

You'll find the best advice on Amazon Redshift best practices for designing tables. It goes into quite a bit of detail.

However, my rule of thumb is:

  • The DISTKEY should be the column most used in JOINs between tables
  • The SORTKEY should be the column most used in WHERE statements
  • Use DISTSTYLE ALL for small lookup tables

Upvotes: 6

Related Questions