G_Yahia
G_Yahia

Reputation: 91

Sub-Query or CTE

I'm currently working on optimizing a query for a dataset with 14 columns and 6.6 million rows, and I’d love to get your thoughts on which approach is better in terms of efficiency, clarity and maintainability. I’ve got two versions of the query, and while both work, I’m curious to hear your opinions on which one you’d prefer and why.

The goal of the queries is to retrieve the top 3 most requested services based on the request count for each season.

In terms of performance the sub-query took 1m33s to execute, the CTE one took 1m24s.

QUERY 1: Using Sub-Query

SELECT season, service_name, request_count
FROM (
    SELECT season, service_name, COUNT(*) AS request_count,
           ROW_NUMBER() OVER (PARTITION BY season ORDER BY COUNT(*) DESC) AS row_num
    FROM service_requests_clean_v3
    GROUP BY season, service_name
) AS ranked_requests
WHERE row_num IN (1,2,3)
ORDER BY season;

QUERY 2: Using a CTE

WITH ranked_requests AS (
    SELECT season, service_name, COUNT(*) AS request_count
    FROM service_requests_clean_v3
    GROUP BY season, service_name
),
top_requests AS (
    SELECT season, service_name, request_count,
           ROW_NUMBER() OVER (PARTITION BY season ORDER BY request_count DESC) AS row_num
    FROM ranked_requests
)
SELECT season, service_name, request_count
FROM top_requests
WHERE row_num IN (1,2,3)
ORDER BY season;

Below a sample of the data:

service_request_id requested_date updated_date closed_date status_description source service_name agency_responsible comm_code comm_name longitude latitude point season
21-00609175 2021-08-17 0:00 2021-08-19 0:00 2021-08-19 0:00 Closed Phone CBS Inspection - Electrical PD - Calgary Building Services HAR HARVEST HILLS -114.052778 51.14767759 POINT (-114.052777977324 51.14767758603) Summer
21-00609171 2021-08-17 0:00 2021-08-19 0:00 2021-08-19 0:00 Closed Phone Bylaw - Noise Concerns CS - Calgary Community Standards BLN BELTLINE -114.0551888 51.03743793 POINT (-114.055188797256 51.037437926217) Summer
18-01094780 2018-11-12 0:00 2020-06-24 0:00 2020-06-24 0:00 Closed App Roads - Traffic Signal Timing Inquiry TRAN - Roads DAL DALHOUSIE -114.1581646 51.10941383 POINT (-114.158164615208 51.109413833961) Fall
18-01017964 2018-10-17 0:00 2020-06-24 0:00 2020-06-24 0:00 Closed Phone Roads - Traffic Signal Timing Inquiry TRAN - Roads SHN SHAWNESSY -114.0737568 50.90330333 POINT (-114.073756821499 50.903303329478) Fall
18-00948852 2018-09-25 0:00 2021-02-04 0:00 2021-02-04 0:00 Closed Phone Roads - Debris on Street/Sidewalk/Boulevard TRAN - Roads ALB ALBERT PARK/RADISSON HEIGHTS -113.9967838 51.04454101 POINT (-113.996783821473 51.044541013835) Fall

UPDATE

Found a way to optimize the performance of the queries by adding an index:

CREATE INDEX idx_season_service_name ON service_requests_clean_v3 (season, service_name);

Running the same queries after creating the index had a big impact on the one using the CTE, time dropped to 14s, almost no change for the sub-query one.

To double check the result, I updated the table by creating a month_requested column and indexing it as well

ALTER TABLE service_requests_clean_v3
ADD COLUMN month_requested VARCHAR(20);

UPDATE service_requests_clean_v3
SET month_requested = MONTHNAME(requested_date);

CREATE INDEX idx_month_requested_service_name ON service_requests_clean_v3 (month_requested, service_name);

Using th same type of queries but this time Grouping and Ordering By month, the index had the exact same results.

CONCLUSION FOR THIS SCENARIO : With an index, the CTE query is clearer and way more faster.

Looking forward to your insights!

Upvotes: -1

Views: 84

Answers (0)

Related Questions