Does Redshift distribute by DISTKEY sequentially?

Question

I have a Redshift table with page hits, like so

CREATE TABLE hits
(
  user_id INT,
  ts TIMESTAMP,
  page VARCHAR(255)
)
SORTKEY(user_id, ts)
DISTKEY(user_id);

Since I'll be running a bunch of window functions over user_id, I thought it would be a good idea to distribute the table by user_id so nodes don't have to exchange data on users before being able to execute the query.

But the users are only ever active for some time and are numbered sequentially. user_id and time are therefore correlated so whenever I run a query that subsets by time (ts) this will lead to skew if Redshift also distributes by user_id sequentially. This would be less of a problem if it distributed by the DISTKEY randomly. My question is: does it?

(I'm new to Redshift so all of this may just be a total misunderstanding of how things work in general. In that case, apologies in advance!)

John Rotenstein · Accepted Answer

Amazon Redshift uses a hash of the DISTRIBUTION KEY (DISTKEY) to distribute data records amongst nodes.

Thus, records will be distributed differently on a 3-node cluster than a 4-node cluster.

If you are seeking evenly-distributed data, use the EVEN distribution method, which simply spreads records evenly between nodes. (However, this is unlikely to be optimal for your use-case.)

See documentation:

Does Redshift distribute by DISTKEY sequentially?

Answers (1)

Related Questions