Rock
Rock

Reputation: 2977

Distributed by multiple columns

Is there a benefit of using multiple columns on distribution when creating a table? For instance:

CREATE TABLE data_facts (
    data_id int primary key,
    channel_id smallint,
    chart_id smallint,
    demo_id smallint,
    value numeric)
DISTRIBUTED BY (
    channel_id,
    chart_id,
    demo_id)

as there will be chance I need join data_facts with three different tables channel, chart and demo using channel_id, chart_id and demo_id respectively.

Specifically,

  1. Should I always add distribution and include all id(s) that I'm using for joining in terms of efficiency?
  2. If so, does the order of these id(s) matter?
  3. How does this work on an architecture level? (optional)

Thanks!

Upvotes: 4

Views: 9203

Answers (2)

Nick Galemmo
Nick Galemmo

Reputation: 21

No. Multiple hash keys do not provide benefits except when you are doing a hash distribution AND a single key does not provide a reasonably even distribution.

Co-located joins will occur under the following conditions:

  • It is an equijoin (key = key)
  • All distribution columns are used in the join.

If both conditions are not met, one of the tables will be broadcast across all computing nodes, resulting in performance problems and occasional query failure.

The practicality of using distribution has a lot to do with your modeling technique. It is more effective in star schema than 3NF structures. In star schemas it is common to distribute facts on the key of its largest dimension table. Colocation will occur if you join to that dimension. This gives you the greatest benefit since this would be the most time consuming join. Typically, the smaller dimension tables (in the query) wind up being broadcast and held in memory for the join.

Upvotes: 2

aleroot
aleroot

Reputation: 72646

It depends on how much you want to shard the database, and on how less records you want to distribute in each partition, I mean if you add more than one column in the distribution you will fragment a lot more the data into more partitions.

It also depends if you shard by modulo or hash ...

However, in my opinion, if you have a multiple columns primary key and you want to shard by this primary key could have a sense distributing by multiple columns(with all the columns in the primary key) otherwise you should shard by a single column that in most cases is enough .

Upvotes: 2

Related Questions