Reputation: 93
I am new to redshift. I dont understand which column will be suitable for setting a distribution key to get improved query performance. How to find the best column? and how the data will be distributed across the nodes using dist key?
Upvotes: 1
Views: 819
Reputation: 5729
Its very wide question, its hard to provide your short answer. Anyways, let me try to summarize here, in Redshift
there are two types of key, distkey
and sortkey
.
distkey - table’s distkey is the column on which it’s distributed to each node. Rows with the same value in this column are guaranteed to be on the same node.
sortkey - table’s sortkey is the column by which it’s sorted within each node. It should be applied for columns you usually do order by
.
Lets focus on distkey
here.
Distribution key could be of two types, 'Even' or 'All'. Distribution keys are used for achieve following.
'All' distribution style should be used for - have slowly changing data, reasonable size (i.e., few millions but not 100s of millions of rows), missing common distribution key for frequent joins. 'Even' distribution style should be used for- tables not frequently joined or aggregated and large tables without acceptable candidate keys.
Here are some good materials to read. https://www.slideshare.net/AmazonWebServices/deep-dive-on-amazon-redshift-64919704 https://www.youtube.com/watch?v=iuQgZDs-W7A https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html https://docs.aws.amazon.com/redshift/latest/dg/c_Distribution_examples.html
I hope this gives some way for you to move forward.
Upvotes: 1