Reputation: 125
I am starting to do some research on query tuning, and have been experimenting with using distkey and sortkey. From what I've read if I set the distkey to the joining column, the query planner will use a merge join instead of a hash join, which should be faster in Redshift. I was wondering if this also applies to temporary tables? Our production tables are actually views, so they do not have any keys already set. I'm not sure why we don't use the actual warehouse tables.
Upvotes: 4
Views: 9847
Reputation: 41
Yes, it can be done. Just put the distkey before the start of the table query
create temp table a distkey(column_name) as (select query .....)
Upvotes: 2
Reputation: 11032
Yes, keys can be set for temporary tables:
create temp table fred DISTKEY (1) as ...
this is easily done with column position - first column in this example. You can also set the distribution style on temp tables is you so desire. Doing this can force data to stay "on node" for intermediate results in very large and complex queries. Redshift does a good job make reasonable decisions on how to distribute intermediate results but isn't perfect and doesn't understand the nature of the data. I've done this with good results when large data images are in play.
As to you second point about using views instead of tables - In Redshift standard views are basically SQL macros that are flattened / optimized through by the Redshift query compiler. So use of views instead of tables is not bad in itself. Use of view, especially complex ones, can hide what is being done by the query and this can add unneeded and unexpected complexity to the query. The keys are set in the tables referenced by the views. (I'm assuming that the views are not referencing external/spectrum tables)
Lastly, you state you are looking to achieve Merge Join behavior to improve performance. While it is true that this is the fastest type of join, the time and work required to get merge joins to happen on temp tables will not be offset by this performance gain (experience). Redshift will only use a Merge join when it is sure that the data being joined will "zipper" together without issue. If it isn't completely sure this is the case it has to perform a Hash join which is a more general process. To get Redshift to do the Merge join you will need to sort and analyze your temp tables which will cost much more time than the savings you will get. It is far more important to have your joins be "DIST NONE" - no network distribution of the data - than moving from a hash join to a merge join.
Upvotes: 13