Reputation: 51
I have a use case where in a single database in AWS Redshift there are multiple client tables are stored, each with their own separate schema segregation so that queries can run concurrently as they use their own separate tables. I am running them on the default queue which has the priority set as HIGHEST.
Recently I had a case where 1 query took all the resources of the cluster and stalled all other queries to be run for a very long time. This query performed join operations and inserted data into some tables. One approach I have implemented to tackle this issue is to create 3 additional queues and redirect some of the queries to them from the default queue and switch to materialised view from normal view in the query.
From the documentation I understand that you can define a maximum of 8 queues and each queue has 5 query slots (40 across 8 queues) be default. Since most of my queries are write heavy (Insert, Update & Joins) and with multiple schemas (more than 30), I am thinking of utilising all the 8 queues with auto-wlm.
With this scenario I am confused when there is only 1 query running on a queue that has priority set as HIGHEST, will it be able to utilise all the available resources of the cluster or will it just take the memory available to the 1 slot in the queue it's running? Please feel free to suggest any other approach you might feel would be helpful to implement auto-wlm or point me in the right direction (articles, case-studies, etc.)
Feel free to point out any wrong interpretation I might have presented above from my understanding of the documentation.
Upvotes: 0
Views: 1778
Reputation: 11032
Max Ganz II did a fair job explaining autoWLM and its limitations so I won't rehash that. I will speak to the query degrading performance. I spend almost 30 years designing microprocessors and high-end computer systems and 5 plus helping people use Redshift. I've seen this before and there is likely nothing you can do to improve this using WLM settings but there is hope.
The issue is LIKELY due to the query overloading one or several system resources (which is what you suspect). However, the database can only work around the edges to limit what resources the query processes are using. WLM can control how big the malloc'ed chucks are so that the query can run in a more limited memory footprint but it cannot control how much memory the query uses. WLM can try to limit the amount of time a query runs on the CPU but it really doesn't control the process scheduler, the OS does. These WLM settings can make an expensive query "friendlier" to other queries but these are only adjustments around the edges.
The OS (Linux) controls what runs on the hardware and the hardware has limits. When one of these limits are reached all work on the system will be impacted. This is true of all database systems not just Redshift, but Redshift is a built on a cluster of high-end industry standard hardware so some of these limits are hit earlier than on a purpose build scale-up system (and some later). The culprits to watch out for are memory, CPU, disk IO, and network IO (along with overloading the leader node which adds a few more pinch points).
There is a high likelihood that your query in question is overloading one of these hardware systems and needs a rewrite. The 2 most common causes are queries with high spill that are swapping TBs of data to disk and queries that are distributing TBs of data across the network. Both of these can be happening in the same query. Both the disk IO system and network IO system have limited bandwidths and their use is controlled by the OS - there is nothing the database layer can do to limit one query's use of these.
I strongly suspect that you will need to look into this query and see what network and IO resources it is using. (It could be due to other aspects but these are the top 2.) Here are 3 queries I've used to start digging into similar issues:
-- High network queries
select userid, starttime, query, sum(bytes)/1000000000 as network_gbytes
from stl_dist
where userid <> 1 and starttime > getdate() - interval '1 day'
group by 1, 2, 3
having network_gbytes > 10;
-- High spill queries
select userid, starttime, query, sum(bytes)/1000000000 as spill_gbytes
from stl_scan s
where perm_table_name ilike '%internal worktable%'
and userid <> 1 and starttime > getdate() - interval '1 day'
group by 1, 2, 3
having spill_gbytes > 10 ;
-- High scan queries
select userid, starttime, query, sum(bytes)/1000000000 as table_scan_gbytes
from stl_scan s
where perm_table_name not ilike '%internal worktable%'
and userid <> 1 and starttime > getdate() - interval '1 day'
group by 1, 2, 3
having table_scan_gbytes > 100;
These are just general search queries for the past day so you will need to adjust them to focus on your query and size the "having" clause to your cluster.
Upvotes: 2