Breathe
Breathe

Reputation: 724

Does Redshift use the same key distribution when two tables have the same distribution keys?

I have several tables that contain the field customer_id.

There is not a lot of customer_ids but the underlying data is big (100s Gb per customer id).

All my queries always use this customer_id one way or another: join, aggregate or filter.

Consequently, this field seems to be the best candidate for the distributing key.

Question: If I set the same DISTRIBUTION KEY(customer_id) on all my tables, will redshift know that I want data for a specific customer on the same node for all these tables? If yes, how does it decide of this? simply by using the column name being similar over all these tables? This seems weird to me but I couldn't find anything on the topic.

Upvotes: 1

Views: 858

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 269410

Amazon Redshift will hash the value in the DISTKEY column.

For example, if your cluster has a total of 4 slices, it will hash the key and use a modulo of 4. This will result in any value have value of 0-3. It then uses the value to distribute those rows on a specific slice.

If the number of nodes, and therefore slices, changes, then the hash will be recalculated for each row.

Always remember the basic rules:

  • Set DISTKEY to the column that is most commonly JOINed
  • Set SORTKEY to the column(s) most commonly used in WHERE

Upvotes: 1

Related Questions