Reputation: 91
I'm currently working on optimizing a query for a dataset with 14 columns and 6.6 million rows, and I’d love to get your thoughts on which approach is better in terms of efficiency, clarity and maintainability. I’ve got two versions of the query, and while both work, I’m curious to hear your opinions on which one you’d prefer and why.
The goal of the queries is to retrieve the top 3 most requested services based on the request count for each season.
In terms of performance the sub-query took 1m33s to execute, the CTE one took 1m24s.
QUERY 1: Using Sub-Query
SELECT season, service_name, request_count
FROM (
SELECT season, service_name, COUNT(*) AS request_count,
ROW_NUMBER() OVER (PARTITION BY season ORDER BY COUNT(*) DESC) AS row_num
FROM service_requests_clean_v3
GROUP BY season, service_name
) AS ranked_requests
WHERE row_num IN (1,2,3)
ORDER BY season;
QUERY 2: Using a CTE
WITH ranked_requests AS (
SELECT season, service_name, COUNT(*) AS request_count
FROM service_requests_clean_v3
GROUP BY season, service_name
),
top_requests AS (
SELECT season, service_name, request_count,
ROW_NUMBER() OVER (PARTITION BY season ORDER BY request_count DESC) AS row_num
FROM ranked_requests
)
SELECT season, service_name, request_count
FROM top_requests
WHERE row_num IN (1,2,3)
ORDER BY season;
Below a sample of the data:
service_request_id | requested_date | updated_date | closed_date | status_description | source | service_name | agency_responsible | comm_code | comm_name | longitude | latitude | point | season |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
21-00609175 | 2021-08-17 0:00 | 2021-08-19 0:00 | 2021-08-19 0:00 | Closed | Phone | CBS Inspection - Electrical | PD - Calgary Building Services | HAR | HARVEST HILLS | -114.052778 | 51.14767759 | POINT (-114.052777977324 51.14767758603) | Summer |
21-00609171 | 2021-08-17 0:00 | 2021-08-19 0:00 | 2021-08-19 0:00 | Closed | Phone | Bylaw - Noise Concerns | CS - Calgary Community Standards | BLN | BELTLINE | -114.0551888 | 51.03743793 | POINT (-114.055188797256 51.037437926217) | Summer |
18-01094780 | 2018-11-12 0:00 | 2020-06-24 0:00 | 2020-06-24 0:00 | Closed | App | Roads - Traffic Signal Timing Inquiry | TRAN - Roads | DAL | DALHOUSIE | -114.1581646 | 51.10941383 | POINT (-114.158164615208 51.109413833961) | Fall |
18-01017964 | 2018-10-17 0:00 | 2020-06-24 0:00 | 2020-06-24 0:00 | Closed | Phone | Roads - Traffic Signal Timing Inquiry | TRAN - Roads | SHN | SHAWNESSY | -114.0737568 | 50.90330333 | POINT (-114.073756821499 50.903303329478) | Fall |
18-00948852 | 2018-09-25 0:00 | 2021-02-04 0:00 | 2021-02-04 0:00 | Closed | Phone | Roads - Debris on Street/Sidewalk/Boulevard | TRAN - Roads | ALB | ALBERT PARK/RADISSON HEIGHTS | -113.9967838 | 51.04454101 | POINT (-113.996783821473 51.044541013835) | Fall |
UPDATE
Found a way to optimize the performance of the queries by adding an index:
CREATE INDEX idx_season_service_name ON service_requests_clean_v3 (season, service_name);
Running the same queries after creating the index had a big impact on the one using the CTE, time dropped to 14s, almost no change for the sub-query one.
To double check the result, I updated the table by creating a month_requested
column and indexing it as well
ALTER TABLE service_requests_clean_v3
ADD COLUMN month_requested VARCHAR(20);
UPDATE service_requests_clean_v3
SET month_requested = MONTHNAME(requested_date);
CREATE INDEX idx_month_requested_service_name ON service_requests_clean_v3 (month_requested, service_name);
Using th same type of queries but this time Grouping and Ordering By month, the index had the exact same results.
CONCLUSION FOR THIS SCENARIO : With an index, the CTE query is clearer and way more faster.
Looking forward to your insights!
Upvotes: -1
Views: 84