Reputation: 186
I have 2 tables in AWS redshift. The details are as below
a) impressions (to count the number of impressions of a particular ad)
b) clicks (to count the number of clicks of a particular ad).
I have a single DC1 Large cluster with 2 slices.
I am trying to run the below query
select impressions.offer_id, count(imp_cnt) from
bidsflyer.tblImpressionLog_Opt impressions
full join bidsflyer.tblTrackingLinkLog_Opt clicks
on impressions.offer_id=clicks.offer_id and date_trunc('week',
impressions.created_on)=date_trunc('week', clicks.created_on)
where impressions.created_on >= '2017-07-27 00:00:00'
group by 1
This query takes more then 8 mins to run. I think this is quite large considering the volume of data, which I feel is not huge.
The query plan looks like something below
XN HashAggregate (cost=2778257688268.43..2778257688268.60 rows=67 width=12)
-> XN Hash Left Join DS_DIST_NONE (cost=179619.84..2778170875920.65 rows=17362469555 width=12)
Hash Cond: (("outer".offer_id = "inner".offer_id) AND (date_trunc('week'::text, "outer".created_on) = date_trunc('week'::text, "inner".created_on)))
-> XN Seq Scan on tblimpressionlog_opt impressions (cost=0.00..724967.36 rows=57997389 width=20)
Filter: (created_on >= '2017-07-27 00:00:00'::timestamp without time zone)
-> XN Hash (cost=119746.56..119746.56 rows=11974656 width=12)
-> XN Seq Scan on tbltrackinglinklog_opt clicks (cost=0.00..119746.56 rows=11974656 width=12)
Can anyone provide me guidance of the correct usage of distribution key and sort keys.
How should I design my query?
Upvotes: 1
Views: 5770
Reputation: 4208
Table setup:
1) According to the plan, the most expensive operation is grouping by offer_id. This makes sense because you didn't sort or distribute your data by offer_id. Your tables are quite large, so you can recreate the table with an interleaved sort key by (offer_id,created_on)
(interleaved keys are supposed to give equal and order-independent weight to the included columns and are known to have positive effect on larger tables).
2) If you join by weeks you can materialize your week column (create a physical column and populate it with date_trunc
output). That might save you some computation effort to get these values dynamically during the join. However, this operation is cheap and if your table is already sorted by timestamp column Redshift might already scan the appropriate blocks only. Also, if each offer runs for a short period of time (meaning offer column has high cardinality and high correlation with time column) you can have a compound sort key by (offer_id
,week_created
) that will allow merge join that is faster, and aggregate will fun fast as well.
3) if you don't use ad_campaign
in other queries you can distribute both tables by offer_id
. Having join column in distribution key is a good practice, it's unlikely that your query will benefit from this since you have a single node and distribution style mostly affects multinode setups.
All recommendations are just the assumptions without knowing the exact nature of your data, they require running benchmarks (create table with the recommended configuration, copy data, vaccuum, analyze, run the same query at least 3 times and compare times with the original setup). I would appreciate if you do this and post results here.
RE the query itself, you can replace FULL JOIN
with JOIN
because you don't need it. FULL JOIN
should be used when you want to get not only the intersection of both tables but also impressions that don't have related clicks and vice versa. Which doesn't seem the case because you are filtering by impressions.created_on
and group by impressions.offer_id
. So, all you need is just the intersection. Replacing FULL JOIN
by simple JOIN
also might affect query performance. If you want to see the offers that have zero clicks you can use LEFT JOIN
.
Upvotes: 2
Reputation: 1802
Merge join is faster than hash join, you should try to achieve merge join. You sort key looks okay, but is your data actually sorted? Redshift does not automatically keep table's rows sorted by sort key, there is no way for redshift to perform merge join on your table. Running a full vacuum on the table, redshift will start performing merge join.
select * from svv_table_info where table = 'impressions'
select * from svv_table_info where table = 'clicks'
Use above query to check the amount of unsorted data you have in your table.
Run a full vacuum on both your tables. Depending on the amount of unsorted data this might take a while and use a lot of your cluster resource.
VACUUM impressions to 100 percent
VACUUM clicks to 100 percent
If I’ve made a bad assumption please comment and I’ll refocus my answer.
Upvotes: 1