Optimizating Queries for Large Redshift Datasets

Question

The main Redshift tables I work with (new role) hold clickstream data and they are large (5-10MM new rows per table per day). I am testing different techniques for getting the best optimization when I need to JOIN these tables with each other, with other tables, or for querying one of them. To do this, I use sub queries and temp tables (with proper DIST and SORT keys) tactically to lower computation time/cost. However, I am self taught and want to improve.

If I am selecting two columns from one of the large tables (table1) where column 'id' needs to be unique and column 'date' needs to be the minimum value, and I am also joining with another table (table2), at a conceptual level, what is the best sequence/use/design of sub queries/temp tables/joins to execute this and what factors influence this? (e.g. if it is only 2 days worth of data verse 6 months) Short of specific help, what are some guiding principles?

Neither id or date are sort/dist/primary/index keys of the parent table (table1). I am looking for general guidance, not specific advice on this query, unless is proves illustrative (and I have read the redshift documentation). Thanks

For example:

Table1: id, date, and other columns Table2: id, tag

Version 1: SELECT id, MIN(date), tag FROM table1 LEFT JOIN table2 USING(id) WHERE date BETWEEN 'xxxx-xx-xx' AND 'xxxx-xx-xx' GROUP BY id, tag

Version 2: CREATE TEMP TABLE accounts SORTKEY(id) DISTKEY(id) AS ( WITH prep AS ( SELECT id, date FROM table WHERE date BETWEEN 'xxxx-xx-xx' AND 'xxxx-xx-xx' ) SELECT id, MIN(date) FROM prep GROUP BY id ) And then join table2 using id

Version 3: CREATE TEMP TABLE accounts SORTKEY(id) DISTKEY(id) AS ( SELECT id, date FROM table 1 WHERE date BETWEEN 'xxxx-xx-xx' AND 'xxxx-xx-xx' ) And then join table2 using id and conduct GROUP BY and MIN within the same query

Optimizating Queries for Large Redshift Datasets

Answers (1)

Related Questions