AWS Enthusiastic
AWS Enthusiastic

Reputation: 186

Optimal way of using joins in Redshift

I have 2 tables in AWS redshift. The details are as below

a) impressions (to count the number of impressions of a particular ad)

  1. Number of rows (170 million)
  2. distribution key(ad_campaign)
  3. sort key (created_on)

b) clicks (to count the number of clicks of a particular ad).

  1. Number of rows (80 million)
  2. distribution key(ad_campaign)
  3. sort key (created_on)

I have a single DC1 Large cluster with 2 slices.

I am trying to run the below query

select impressions.offer_id, count(imp_cnt) from 
  bidsflyer.tblImpressionLog_Opt impressions
full join bidsflyer.tblTrackingLinkLog_Opt clicks
  on impressions.offer_id=clicks.offer_id and date_trunc('week', 
impressions.created_on)=date_trunc('week', clicks.created_on)
  where impressions.created_on >= '2017-07-27 00:00:00'
  group by 1

This query takes more then 8 mins to run. I think this is quite large considering the volume of data, which I feel is not huge.

The query plan looks like something below

XN HashAggregate  (cost=2778257688268.43..2778257688268.60 rows=67 width=12)
      ->  XN Hash Left Join DS_DIST_NONE  (cost=179619.84..2778170875920.65 rows=17362469555 width=12)

    Hash Cond: (("outer".offer_id = "inner".offer_id) AND (date_trunc('week'::text, "outer".created_on) = date_trunc('week'::text, "inner".created_on)))

  ->  XN Seq Scan on tblimpressionlog_opt impressions  (cost=0.00..724967.36 rows=57997389 width=20)
          Filter: (created_on >= '2017-07-27 00:00:00'::timestamp without time zone)

  ->  XN Hash  (cost=119746.56..119746.56 rows=11974656 width=12)
          ->  XN Seq Scan on tbltrackinglinklog_opt clicks  (cost=0.00..119746.56 rows=11974656 width=12)

Can anyone provide me guidance of the correct usage of distribution key and sort keys.

How should I design my query?

Upvotes: 1

Views: 5770

Answers (2)

AlexYes
AlexYes

Reputation: 4208

Table setup:

1) According to the plan, the most expensive operation is grouping by offer_id. This makes sense because you didn't sort or distribute your data by offer_id. Your tables are quite large, so you can recreate the table with an interleaved sort key by (offer_id,created_on) (interleaved keys are supposed to give equal and order-independent weight to the included columns and are known to have positive effect on larger tables).

2) If you join by weeks you can materialize your week column (create a physical column and populate it with date_trunc output). That might save you some computation effort to get these values dynamically during the join. However, this operation is cheap and if your table is already sorted by timestamp column Redshift might already scan the appropriate blocks only. Also, if each offer runs for a short period of time (meaning offer column has high cardinality and high correlation with time column) you can have a compound sort key by (offer_id,week_created) that will allow merge join that is faster, and aggregate will fun fast as well.

3) if you don't use ad_campaign in other queries you can distribute both tables by offer_id. Having join column in distribution key is a good practice, it's unlikely that your query will benefit from this since you have a single node and distribution style mostly affects multinode setups.

All recommendations are just the assumptions without knowing the exact nature of your data, they require running benchmarks (create table with the recommended configuration, copy data, vaccuum, analyze, run the same query at least 3 times and compare times with the original setup). I would appreciate if you do this and post results here.

RE the query itself, you can replace FULL JOIN with JOIN because you don't need it. FULL JOIN should be used when you want to get not only the intersection of both tables but also impressions that don't have related clicks and vice versa. Which doesn't seem the case because you are filtering by impressions.created_on and group by impressions.offer_id. So, all you need is just the intersection. Replacing FULL JOIN by simple JOIN also might affect query performance. If you want to see the offers that have zero clicks you can use LEFT JOIN.

Upvotes: 2

Rahul Gupta
Rahul Gupta

Reputation: 1802

Merge join is faster than hash join, you should try to achieve merge join. You sort key looks okay, but is your data actually sorted? Redshift does not automatically keep table's rows sorted by sort key, there is no way for redshift to perform merge join on your table. Running a full vacuum on the table, redshift will start performing merge join.

select * from svv_table_info where table = 'impressions'
select * from svv_table_info where table = 'clicks'

Use above query to check the amount of unsorted data you have in your table.
Run a full vacuum on both your tables. Depending on the amount of unsorted data this might take a while and use a lot of your cluster resource.

VACUUM impressions to 100 percent
VACUUM clicks to 100 percent

If I’ve made a bad assumption please comment and I’ll refocus my answer.

Upvotes: 1

Related Questions