KarthiK
KarthiK

Reputation: 93

how the data will distribute when using dist key for a column in redshift

I am new to redshift. I dont understand which column will be suitable for setting a distribution key to get improved query performance. How to find the best column? and how the data will be distributed across the nodes using dist key?

Upvotes: 1

Views: 819

Answers (1)

Red Boy
Red Boy

Reputation: 5729

Its very wide question, its hard to provide your short answer. Anyways, let me try to summarize here, in Redshift there are two types of key, distkey and sortkey. distkey - table’s distkey is the column on which it’s distributed to each node. Rows with the same value in this column are guaranteed to be on the same node.

sortkey - table’s sortkey is the column by which it’s sorted within each node. It should be applied for columns you usually do order by.

Lets focus on distkey here.

Distribution key could be of two types, 'Even' or 'All'. Distribution keys are used for achieve following.

  • Distribute data evenly for parallel processing
  • Minimize data movement

'All' distribution style should be used for - have slowly changing data, reasonable size (i.e., few millions but not 100s of millions of rows), missing common distribution key for frequent joins. 'Even' distribution style should be used for- tables not frequently joined or aggregated and large tables without acceptable candidate keys.

Here are some good materials to read. https://www.slideshare.net/AmazonWebServices/deep-dive-on-amazon-redshift-64919704 https://www.youtube.com/watch?v=iuQgZDs-W7A https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html https://docs.aws.amazon.com/redshift/latest/dg/c_Distribution_examples.html

I hope this gives some way for you to move forward.

Upvotes: 1

Related Questions