Reputation: 11
Well I recently got into this area of Redshift, trying to optimize disk usage and performance of my database, and having read lots of information on AWS about the topic, I still have some doubts. First of all, to my database structure. Per schema, I have 3 master tables, with 3 different IDs, these are now DISTSTLYE ALL tables, being small in size.
Each master table has different amounts of IDs,
the date table --> largest one (#1 most joined)
the store table --> medium one (#3 most joined)
the item table --> smallest one (#2 most joined)
Then I have my core table, which has needed combinations of these IDs to display additional information about them. Anyway, this table should be a DISTSTYLE KEY type, based on my knowledge. Well, which of the 3 IDs should I select to be my DIST KEY?
Whats the criteria for this decision? I understand that for joins I need to look at the Sort Key, well that has been understood and defined to the ID_date, because its the most joined table. So now, what about the distribution per node of this table?
I'm sorry if I'm rambling, I dont want to leave any information out. If I have, feel free to ask! Thanks for taking the time to read!
Upvotes: 1
Views: 4792
Reputation: 269370
You'll find the best advice on Amazon Redshift best practices for designing tables. It goes into quite a bit of detail.
However, my rule of thumb is:
DISTKEY
should be the column most used in JOINs between tablesSORTKEY
should be the column most used in WHERE statementsDISTSTYLE ALL
for small lookup tablesUpvotes: 6