Andrew O'Brien
Andrew O'Brien

Reputation: 1813

Redshift Distribution By Child Columns

My Situation

I have some tables in my redshift cluster that all break down into either an order_id, shipment_id, or shipment_item_id depending on how granular the table is. order_id is a 1 to many relationship on shipment_id and shipment_id is a 1 to many on shipemnt_item_id.

My Question

I distribute on order_id, so all shipment_id and shipment_item_id records should be on the same nodes across the tables since they are grouped by order_id. My question is, when I have to join on shipment_id or shipment_item_id then will redshift know that the records are on the same nodes, or will it still broadcast the tables since they aren't joined on order_id?

Example Tables

unified_order                                   shipment_details
+----------+-------------+------------------+   +-------------+-----------+--------------+
| order_id | shipment_id | shipment_item_id |   | shipment_id | ship_day  | ship_details |
+----------+-------------+------------------+   +-------------+-----------+--------------+
|        1 |           1 |                1 |   |           1 | 1/1/2017  | stuff        |
|        1 |           1 |                2 |   |           2 | 5/1/2017  | other stuff  |
|        1 |           1 |                3 |   |           3 | 6/14/2017 | more stuff   |
|        1 |           2 |                4 |   |           4 | 5/13/2017 | less stuff   |
|        1 |           2 |                5 |   |           5 | 6/19/2017 | that stuff   |
|        1 |           3 |                6 |   |           6 | 7/31/2017 | what stuff   |
|        2 |           4 |                7 |   |           7 | 2/5/2017  | things       |
|        2 |           4 |                8 |   +-------------+-----------+--------------+
|        3 |           5 |                9 |   
|        3 |           5 |               10 |   
|        4 |           6 |               11 |   
|        5 |           7 |               12 |   
|        5 |           7 |               13 |   
+----------+-------------+------------------+

Distribution

distribution_by_node
+------+----------+-------------+------------------+
| node | order_id | shipment_id | shipment_item_id |
+------+----------+-------------+------------------+
|    1 |        1 |           1 |                1 |
|    1 |        1 |           1 |                2 |
|    1 |        1 |           1 |                3 |
|    1 |        1 |           2 |                4 |
|    1 |        1 |           2 |                5 |
|    1 |        1 |           3 |                6 |
|    1 |        5 |           7 |               12 |
|    1 |        5 |           7 |               13 |
|    2 |        2 |           4 |                7 |
|    2 |        2 |           4 |                8 |
|    3 |        3 |           5 |                9 |
|    3 |        3 |           5 |               10 |
|    4 |        4 |           6 |               11 |
+------+----------+-------------+------------------+

Upvotes: 0

Views: 74

Answers (2)

AlexYes
AlexYes

Reputation: 4208

You can EXPLAIN your query to see how data will be distributed (or not) during the execution. In this doc you'll see how to read the query plan: Evaluating the Query Plan

Upvotes: 1

John Rotenstein
John Rotenstein

Reputation: 269390

The Amazon Redshift documentation does not go into detail how information is shared between nodes, but it is doubtful that it "broadcasts the tables".

Rather, information is probably sent between nodes based on need -- only the relevant columns would be shared, and possibly only sub-ranges of the data.

Rather than worrying too much about the internal implementation, you should test various DISTKEY and SORTKEY strategies against real queries to determine performance.

Follow the recommendations from Choose the Best Distribution Style to minimize the amount of data that needs to be sent between nodes and consult Amazon Redshift Best Practices for Designing Queries to improve queries.

Upvotes: 1

Related Questions