Reputation: 23492
I have to perform relatively complicated query involving multiple deep joins and complex predicates, where the result (and criteria) depends of suitable entries available meeting the criteria. There are kinda primary and secondary criteria involved, primary is applied always, and secondary strikes in if the the result is not satisfying. In a nutshell
But if there's not enough different document types or different customers with a document, try to still satisfy the amount of N documents
I choose declarative (query) approach over imperative (cursors and counters). And this is where the WITH
clause comes in. Roughly, by using multiple WITH
blocks (CTE's) I declare queries (I like to think them as ad-hoc views) to declare two distinct target set for both document types. In the end, I UNION
subsets of different CTE's for the final result, performing some COUNT
checks to limit the amount.
Multiple CTE's are referencing each other and are referenced from multiple places in the context of COUNT
and NOT EXISTS
for example. I'm novice with SQL and happened to encounter WITH
by a coincidence, and chose to use it intuitively. Is this the correct use case for WITH
, or an anti pattern? How does this solution compares in performance to implementing the same functionality with cursors and counters in an imperative style? Have I chosen the wrong approach? We are talking about tables with million entries.
Here is the the query as a whole. Pardon me, I had to obscure the fields due confidentiality.
WITH target_documents AS (
SELECT
<Necessary fields>
FROM documents l
WHERE
<Suitable document criteria>
),
target_documents_type_1 AS (
SELECT * FROM target_documents WHERE type = 1
),
target_documents_type_2 AS (
SELECT * FROM target_documents WHERE type = 2
),
target_customers AS (
SELECT
<Necessary fields>
FROM customers a
WHERE
<Suitable customer criteria>
AND
EXISTS(
SELECT 1 FROM target_documents l WHERE l.customer_id = a.customer_id
)
),
target_customers_type_1 AS (
SELECT * FROM target_customers a WHERE EXISTS(
SELECT 1 FROM target_documents_type_1 l WHERE l.customer_id = a.customer_id
)
AND ROWNUM <= (<N> / 2)
),
target_customers_type_2 AS (
SELECT * FROM target_customers a WHERE EXISTS(
SELECT 1 FROM target_documents_type_2 l WHERE l.customer_id = a.customer_id
)
AND a.customer_id NOT IN (
SELECT customer_id FROM target_customers_type_1
)
AND ROWNUM <= <N>
),
-- This is the set, which meets the the primary criteria:
-- Contains only distinct customers
-- The amount of different document types is balanced as much as possible
different_customers_set AS (
SELECT
<Necessary fields>
FROM target_customers_type_1 a -- rows 0--(<N>/2) amount
JOIN target_documents_type_1 l ON (l.customer_id = a.customer_id)
WHERE
l.create_dt = (SELECT MAX(create_dt) FROM target_documents_type_1 WHERE customer_id = l.customer_id)
UNION ALL
SELECT
<Necessary fields>
FROM target_customers_type_2 a -- rows 0--<N> amount
JOIN target_documents_type_2 l ON (l.customer_id = a.customer_id)
WHERE
l.create_dt = (SELECT MAX(create_dt) FROM target_documents_type_2 WHERE customer_id = l.customer_id) AND
ROWNUM <= <N> - (SELECT COUNT(*) FROM target_customers_type_1) -- Limit the total to max N rows
)
-- Final result: primary criteria result filled with the result of secondary criteria
SELECT * FROM different_customers_set
UNION ALL
SELECT
<Necessary fields>
FROM target_customers a
JOIN target_documents l ON (l.customer_id = a.customer_id AND l.document_id NOT IN (SELECT document_id FROM different_customers_set))
WHERE
ROWNUM <= <N> - (SELECT COUNT(1) FROM different_customers_set);
Is this the correct usage of WITH
clause? Are there some obvious performance issues, where I should refactor? Or should I have done all this imperatively? Also, this query itself defines a cursor, which is used opened repeatedly in a loop (the loop defines certain criteria for customers).
I'm especially concerned, how the optimizer handles those WITH
blocks. Is the most efficiently plan always used (so there's not performance penalty compared to using cursors)?
Upvotes: 5
Views: 13683
Reputation: 2099
Using multiple CTEs isn't inherently bad, and I do this from time to time to make for cleaner code. Here's my general approach for determining if a query meets my performance standards:
Write out a version of the query that you think will grab the data you want. It seems obvious, but make sure you're actually gathering all the data you meant to, with none of the data you don't want.
Run an Explain plan (F10 if using Oracle SQL Developer) to evaluate the cost before you run. I do this to make sure I haven't missed a critical join or index that would cause my query to run for an entire weekend and not return (speaking from experience; in my environment, cancelling a query in SQL Dev is just a suggestion and the database usually ignores the request).
If everything looks OK in the Explain plan, run the query with an Autotrace to evaluate its actual performance. The Explain is just a guess, and the Autotrace is far more accurate.
Compare execution plans with multiple versions of your query. Are you doing everything you can to extract that last bit of performance? Do you see any full table scans where it should be using an index? Do you have any nested loops that can be eliminated?
I rarely if ever across a situation where I think to myself "wow, this query would be way better off without CTEs", though I think the opposite nearly every day.
Upvotes: 4
Reputation: 146229
The only true way to evaluate the performance of a query against some other approach is to benchmark both and see which is the best fit for your data and your application/environment. Everything else is just grimacing and guessing.
Having said which :-)
SQL is generally faster than PL/SQL, especially for large amounts of data because set operations scale better than row-by-row processing.
The temptation with the WITH clause is to build a procedural query rather than a set-based one. It looks like you've dodged that bullet: you're using subquery factoring properly, to reduce the number of times you execute the query to get the same result set.
However, that's just my guess. You could try writing the query the long way, without the WITH clause and see which is better (faster)
Upvotes: 1
Reputation: 3586
target_customers_type_2
CTE reference itself (probably should be referencing target_documents_type_2
).ROWNUM
and generating "random" results should be avoided like a plague. Very difficult to test and debug.Upvotes: 1