Reputation: 3153
I have a query that insert data to a table.
The process creates a table in the database as the end table:
CREATE TABLE IF NOT EXISTS {new_schema}.{new_table} (LIKE {schema}.{table} INCLUDING DEFAULTS);
Then, it inserts into stg table
using:
INSERT INTO {new_schema}.{new_table} (columns....) SELECT columns... FROM {table} and some other logic
This inserts 4000 rows
and takes around 2 minutes to insert
. Table is completely empty. End table has around 450.000 rows
.
End and stg table has disstyle EVEN
, and sorkey(date)
which is defined as date date encode az64
.
I would like to know how can I improve that insert performance. I read in AWS Redshift Best practices that you can load data ordered by the sorkey, so this way you avoid vacuum.
In this case it means to use the following statement?:
SELECT columns... FROM {table} and some other logic ORDER BY date
Maybe in my specific case as I can have a load of data in source tables (load data to stg), but not as much in stg table should I drop the sortkey
in stg_table
and insert to it as I defined first (without order by) and then insert from stg table to end table using order by
? This will limit the complexity of the first big query, and add the order complexity in the one that only has 4k rows.
Upvotes: 0
Views: 579
Reputation: 11032
Sharing some performance statistics of the query would be helpful but I think we can start with a few basic questions. You point to the insert as taking a lot of time. However there's no data on the time it takes for the sourcing select to execute. My first suggestion is to look at the run time of this select. While it is possible that the insert is taking time I suspect this is where most of the time is spent.
What is the runtime of the select by itself? Since this query produces a lot of output you don't want this to flow to your bench as this will just measure the speed of your internet. If you have a sort clause on this query you can just limit the number of rows produced - since the entire query needs to run to complete the sort this time should be close enough. Otherwise you will want to set up a cursor and use that to capture the results )remember that the query doesn't run until the first fetch).
I expect you will see that this query is taking most of the time and normal query optimization processes can be followed. (Look at amount of data read, data growth, redistribution, etc.)
Now if it does turn out that a meaningful portion of the time is spent in the insert, there are a few things to look at. You mention ordering your query results per the target table's sort keys. This in general is a good idea however when the target table is empty this is done by Redshift by default. Also, vacuum time would not be counted against the runtime of this insert.
The first thing I see that is suspect is the EVEN distribution of both tables. When a Redshift table is in EVEN distribution you should read that as "random distribution". So moving data from one "randomly distributed" table to another "randomly distributed" table will mean that a lot of the table data will need to be pushed around the cluster network. This network movement wastes time and consumes a limited resource (network bandwidth). There are a number of good reasons for tables to be in EVEN distribution but all of them have to do with "there's no better option". EVEN distribution, while the default, should be the distribution of last resort.
Ordering the results for insert is another place where time can be spent. As I mentioned this is done if the target table is empty and it has sort keys. Your idea of removing sort keys from intermediate result tables has merit if sorting is costing you significant time.
Other information that will help to understand what is going on: Your cluster size / node type - sizes of tables is relative to cluster size. The query text (sanitized) and explain plan of the query. Table widths (if they are large). Sort style - just tell me you're not using interleaved sort keys.
Upvotes: 2