Reputation: 21
A query plan in HAWQ can be split into several slices which can be run independently. How does HAWQ split query plan into slices?
Upvotes: 0
Views: 140
Reputation: 11
Note that #slices is equal to #motion nodes +1.
Motion node will be added into the plan when there is a need to redistribute data. Aggregation, join, sort functions etc. will all generate motion nodes.
For example, Sort can be partially done on segments, but the data on different segments are still unordered. We must redistributed the data to exact one segment in the upper slice to do the merge sort.
Upvotes: 0
Reputation: 81
First, let us clarify the meaning of slice. Slice is a sub-tree of the whole query plan and is a tree of operators, and these operators can run in a same node. Running in the same node means node running the slice doesn't need communicate with other nodes for data exchange.
So, if there are data exchange requirements, we split the plan, by adding motion node.
As @ztao said, there are three kinds of motion nodes.
Gather. One node needs to gather data from all the nodes. Usually used in the top slice which runs on Query dispatcher side. The dispatcher gathers all the results, do some operations, and return the result back to end-user.
Broadcast. Data in one node needs to be broadcasted to all the nodes. Usually used in join between small table and large table. We can broadcast the data of small tables to all nodes containing data of large table, so a hash-join can be executed next.
Redistribute. Data exist in multiple nodes following by some distribution policy, need redistribute the data according to a new policy. Usually used in join between two large tables, and the distribution key for the two table are not same. Need redistribute one table to ensure they both are collocated.
Upvotes: 2
Reputation: 71
Motion node(Gather/Broadcast/Redistribute) is added for different scenarios which splits query plan to different slices for parallel run purpose. For example, there is a nest loop join, whose outer child is a Table A SeqScan, and inner child is a Table B SeqScan. In optimizer code, it will decide to insert a motion node(would be broadcast or redistribute) in either outer child or inner child based on cost.
NestLoop
/ \
/ \
SeqScan A Broadcast motion
|
SeqScan B
Upvotes: 0