how the data will distribute when using dist key for a column in redshift

Question

I am new to redshift. I dont understand which column will be suitable for setting a distribution key to get improved query performance. How to find the best column? and how the data will be distributed across the nodes using dist key?

Red Boy · Accepted Answer

Its very wide question, its hard to provide your short answer. Anyways, let me try to summarize here, in Redshift there are two types of key, distkey and sortkey. distkey - table’s distkey is the column on which it’s distributed to each node. Rows with the same value in this column are guaranteed to be on the same node.

sortkey - table’s sortkey is the column by which it’s sorted within each node. It should be applied for columns you usually do order by.

Lets focus on distkey here.

Distribution key could be of two types, 'Even' or 'All'. Distribution keys are used for achieve following.

Distribute data evenly for parallel processing
Minimize data movement

'All' distribution style should be used for - have slowly changing data, reasonable size (i.e., few millions but not 100s of millions of rows), missing common distribution key for frequent joins. 'Even' distribution style should be used for- tables not frequently joined or aggregated and large tables without acceptable candidate keys.

Here are some good materials to read. https://www.slideshare.net/AmazonWebServices/deep-dive-on-amazon-redshift-64919704 https://www.youtube.com/watch?v=iuQgZDs-W7A https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html https://docs.aws.amazon.com/redshift/latest/dg/c_Distribution_examples.html

I hope this gives some way for you to move forward.

how the data will distribute when using dist key for a column in redshift

Answers (1)

Related Questions