Reputation: 724
I have several tables that contain the field customer_id.
There is not a lot of customer_ids but the underlying data is big (100s Gb per customer id).
All my queries always use this customer_id one way or another: join, aggregate or filter.
Consequently, this field seems to be the best candidate for the distributing key.
Question: If I set the same DISTRIBUTION KEY(customer_id) on all my tables, will redshift know that I want data for a specific customer on the same node for all these tables? If yes, how does it decide of this? simply by using the column name being similar over all these tables? This seems weird to me but I couldn't find anything on the topic.
Upvotes: 1
Views: 858
Reputation: 269410
Amazon Redshift will hash the value in the DISTKEY
column.
For example, if your cluster has a total of 4 slices, it will hash the key and use a modulo of 4. This will result in any value have value of 0-3. It then uses the value to distribute those rows on a specific slice.
If the number of nodes, and therefore slices, changes, then the hash will be recalculated for each row.
Always remember the basic rules:
DISTKEY
to the column that is most commonly JOINedSORTKEY
to the column(s) most commonly used in WHERE
Upvotes: 1