Reputation: 10203
I have a Redshift table with page hits, like so
CREATE TABLE hits
(
user_id INT,
ts TIMESTAMP,
page VARCHAR(255)
)
SORTKEY(user_id, ts)
DISTKEY(user_id);
Since I'll be running a bunch of window functions over user_id
, I thought it would be a good idea to distribute the table by user_id
so nodes don't have to exchange data on users before being able to execute the query.
But the users are only ever active for some time and are numbered sequentially. user_id and time are therefore correlated so whenever I run a query that subsets by time (ts
) this will lead to skew if Redshift also distributes by user_id
sequentially. This would be less of a problem if it distributed by the DISTKEY
randomly. My question is: does it?
(I'm new to Redshift so all of this may just be a total misunderstanding of how things work in general. In that case, apologies in advance!)
Upvotes: 4
Views: 1459
Reputation: 269410
Amazon Redshift uses a hash of the DISTRIBUTION KEY (DISTKEY) to distribute data records amongst nodes.
Thus, records will be distributed differently on a 3-node cluster than a 4-node cluster.
If you are seeking evenly-distributed data, use the EVEN
distribution method, which simply spreads records evenly between nodes. (However, this is unlikely to be optimal for your use-case.)
See documentation:
Upvotes: 3