Reputation: 9
The main Redshift tables I work with (new role) hold clickstream data and they are large (5-10MM new rows per table per day). I am testing different techniques for getting the best optimization when I need to JOIN these tables with each other, with other tables, or for querying one of them. To do this, I use sub queries and temp tables (with proper DIST and SORT keys) tactically to lower computation time/cost. However, I am self taught and want to improve.
If I am selecting two columns from one of the large tables (table1) where column 'id' needs to be unique and column 'date' needs to be the minimum value, and I am also joining with another table (table2), at a conceptual level, what is the best sequence/use/design of sub queries/temp tables/joins to execute this and what factors influence this? (e.g. if it is only 2 days worth of data verse 6 months) Short of specific help, what are some guiding principles?
Neither id or date are sort/dist/primary/index keys of the parent table (table1). I am looking for general guidance, not specific advice on this query, unless is proves illustrative (and I have read the redshift documentation). Thanks
For example:
Table1: id, date, and other columns Table2: id, tag
Version 1:
SELECT id, MIN(date), tag FROM table1 LEFT JOIN table2 USING(id) WHERE date BETWEEN 'xxxx-xx-xx' AND 'xxxx-xx-xx' GROUP BY id, tag
Version 2:
CREATE TEMP TABLE accounts SORTKEY(id) DISTKEY(id) AS ( WITH prep AS ( SELECT id, date FROM table WHERE date BETWEEN 'xxxx-xx-xx' AND 'xxxx-xx-xx' ) SELECT id, MIN(date) FROM prep GROUP BY id )
And then join table2 using id
Version 3:
CREATE TEMP TABLE accounts SORTKEY(id) DISTKEY(id) AS ( SELECT id, date FROM table 1 WHERE date BETWEEN 'xxxx-xx-xx' AND 'xxxx-xx-xx' )
And then join table2 using id and conduct GROUP BY and MIN within the same query
Upvotes: 0
Views: 486
Reputation: 11082
I'm going to guess that id is the DISTKEY for table2 since you desire to have the table1 sub-query result distributed by it. This is likely a good idea.
The best general advise is to:
I don't know your data or table structures so this is guess work. Your Version 1 may have an issue with data increasing in the JOIN because of a many to many join existing.
In V2 and V3 I think you meant CREATE not WITH. (CREATE TEMP TABLE ...)
V3 likely has the same issue as V1 (many to many join) and reducing data earlier is better (MIN()).
V2 looks the best to me (limited info) but you don't need 2 levels of SELECT. Shouldn't matter as the optimizer should smash these levels. I also would remove the sortkey from the temp table as sorting is likely to cost more time than it saves.
The biggest factor I can see is if table1 is ANALYZED or not. Invalid metadata could cost a lot of time in reading unneeded data from disk.
Next is there a date range that can be applied to table2 as well. Pruning unneeded data before the join could help. Again table2 should be analyzed to help with this.
Since you are joining on id (and grouping by id) but table1 is not distributed by id there will be a need to relocate data around the cluster to get like ids onto the same slice. If table1 could be id distributed this would save time as well (but other factors may prevent this).
Check the query plan and see how much data each step is dealing with. If there are large datasets in a step think about how to reduced the data needed at that step.
Upvotes: 0