Reputation: 1109
I have a query in SQL Server 2008 R2 in the following form:
SELECT TOP (2147483647) *
FROM (
SELECT *
FROM sub_query_a
) hierarchy
LEFT JOIN (
SELECT *
FROM sub_query_b
) expenditure
ON hierarchy.x = expenditure.x AND hierarchy.y = expenditure.y
ORDER BY hierarchy.c, hierarchy.d, hierarchy.e
The hierarchy
subquery contains UNIONS and INNER JOINS. The expenditure
subquery is based on several levels of sub-subqueries, and contains UNIONS, INNER and LEFT JOINS, and ultimately, a PIVOT aggregate.
The hierarchy
subquery by itself runs in 2 seconds and returns 467 rows. The expenditure
subquery by itself runs in 7 seconds and returns 458 rows. Together, without the ORDER BY
clause, the query runs in 11 seconds. However, with the ORDER BY
clause, the query runs in 11 minutes.
The Actual Execution Plan reveals what's different. Without the ORDER BY
clause, both the hierarchy
and expenditure
subqueries are running once each, with the results being Merge Join (Right Outer Join)
joined together. When the ORDER BY
clause is included, the hierarchy
query is still run once, but the expenditure
portion is run once per row from the hierarchy query, and the results are Nested Loops (Left Outer Join)
joined together. Its as if the ORDER BY
clause is causing the expenditure
subquery to become a correlated subquery (which it is not).
To verify that SQL Server was actually capable of doing the query and producing a sorted result set in 11 seconds, as a test, I created a temp table and inserted the results of the query without the ORDER BY
clause into it. Then I did a SELECT * FROM #temp_table ORDER BY c, d, e
. The entire script took the expected 11 seconds, and returned the desired results.
I want to make the query work efficiently with the ORDER BY
clause as one query--I don't want to have to create a stored procedure just to enable the #temp_table hacky solution.
Any ideas on the cause of this issue, or a fix?
Upvotes: 0
Views: 1209
Reputation: 1109
Thanks to @MartinSmith's comment, I got looking at what could cause the major discrepancies between the estimated and actual rows delivered by the expenditure
subquery in the non-ORDER BY
version, even though I eventually wanted to ORDER
it. I thought that perhaps if I can optimize that version a bit, perhaps that would also benefit the ORDER BY
version as well.
As I mentioned in the OP, the expenditure
subquery contains a PIVOT
aggregation across yet another subquery (let's call it unaggregated_expenditure
). I added a layer between the PIVOT
and the unaggregated_expenditure
subquery which aggregated the required column before PIVOT
ing the same column across the required few pivot columns. This added a bit of conceptual complexity, yet was able to reduce the estimated number of rows coming from the PIVOT
from 106,245,000 to 10,307. This change, when applied to the ORDER BY
version of the whole query, resulted in a different Actual Execution Plan that was able to process and deliver the query within the desired 11 seconds.
Upvotes: 1
Reputation: 1269633
To avoid nested loop joins, you can give an option
to the compiler:
SELECT TOP (2147483647) *
FROM (
SELECT *
FROM sub_query_a
) hierarchy
LEFT JOIN (
SELECT *
FROM sub_query_b
) expenditure
ON hierarchy.x = expenditure.x AND hierarchy.y = expenditure.y
ORDER BY hierarchy.c, hierarchy.d, hierarchy.e
option (merge join, hash join)
I generally much prefer to have the optimizer figure out the right query plan. On rare occasions, however, I run into a problem similar to yours and need to make a suggestion to push it in the right direction
Upvotes: 1