Having trouble understanding how running aggregate works

Question

First let me introduce you to an example and after that I'll ask the question. Code

SELECT
    orderyear
   ,qty
FROM
    Sales.OrderTotalsByYear;

gives me a table that looks like this

orderyear   qty
----------- -----------
2007        25489
2008        16247
2006        9581

I need to return for each year the order year, quantity, and running total quantity over the years. That is, for each year, return the sum of the quantity up to that year. So for the earliest year recorded in the view (2006), the running total is equal to that year’s quantity. For the second year (2007), the running total is the sum of the first year plus the second year, and so on. The code looks like this

SELECT
    orderyear
   ,qty
   ,(
        SELECT
            SUM(O2.qty)
        FROM
            Sales.OrderTotalsByYear AS O2
        WHERE
            O2.orderyear <= O1.orderyear
    )
    AS runqty
FROM
    Sales.OrderTotalsByYear AS O1
ORDER BY
    orderyear;

and a table

orderyear   qty         runqty
----------- ----------- -----------
2006        9581        9581
2007        25489       35070
2008        16247       51317

Now, I understand what this code does but I don't understand HOW it does it. I have experience in procedural and object oriented programming but this just drives me crazy. If the query goes in a way like this

FROM

WHERE

GROUP BY

HAVING

SELECT

ORDER BY

then how it manages to combine inner SELECT inside outer SELECT by using relationships between inner and outer SELECT? Is the outer SELECT runned first and it stops when it hits first element inside a table, and then the inner SELECT starts to run for elements where O2.orderyear <= O1.orderyear is true? Or is there something totally different going on?

Damien_The_Unbeliever · Accepted Answer

Logically, for each row that the system has generated (via FROM, WHERE, GROUP BY and HAVING), the system will evaluate the SELECT clause. As part of that evaluation of each row, the system will evaluate the correlated subquery:

(
    SELECT
        SUM(O2.qty)
    FROM
        Sales.OrderTotalsByYear AS O2
    WHERE
        O2.orderyear <= O1.orderyear
)

Using the current row's O1.orderyear value.

However, from a practical standpoint¹, the system may be able to optimize its evaluation of this subquery. A smart enough optimizer, if the statistics suggest it is worth doing, may decide to evaluate the outer query in orderyear order, and to also create a copy of the OrderTotalsByYear table sorted in orderyear order (or to make use of an index that already represents this sort order). In such a case, the system would be able to evaluate this subquery result without having to re-scan the entire OrderTotalsByYear table for each row of the outer query.

What the optimizer does can only be ascertained by obtaining the execution plan, and will depend on your specific tables - their structure, indexes and the data contained within them.

¹SQL is defined in terms of a logical processing order. Implementations are free to perform operations in a different order than the logical processing order, provided they produce the same results which would have been obtained if the logical processing order is followed². SQL is also defined, generally, to work on sets rather than to specify row-by-row or left-to-right processing.

²SQL Server takes more liberties here than it ought to and may generate errors which would not have been generated had it followed the logical processing order. Ho hum.

Having trouble understanding how running aggregate works

Answers (2)

Related Questions