Daniel
Daniel

Reputation: 1120

Having trouble understanding how running aggregate works

First let me introduce you to an example and after that I'll ask the question. Code

SELECT
    orderyear
   ,qty
FROM
    Sales.OrderTotalsByYear;

gives me a table that looks like this

orderyear   qty
----------- -----------
2007        25489
2008        16247
2006        9581

I need to return for each year the order year, quantity, and running total quantity over the years. That is, for each year, return the sum of the quantity up to that year. So for the earliest year recorded in the view (2006), the running total is equal to that year’s quantity. For the second year (2007), the running total is the sum of the first year plus the second year, and so on. The code looks like this

SELECT
    orderyear
   ,qty
   ,(
        SELECT
            SUM(O2.qty)
        FROM
            Sales.OrderTotalsByYear AS O2
        WHERE
            O2.orderyear <= O1.orderyear
    )
    AS runqty
FROM
    Sales.OrderTotalsByYear AS O1
ORDER BY
    orderyear;

and a table

orderyear   qty         runqty
----------- ----------- -----------
2006        9581        9581
2007        25489       35070
2008        16247       51317

Now, I understand what this code does but I don't understand HOW it does it. I have experience in procedural and object oriented programming but this just drives me crazy. If the query goes in a way like this

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT
  6. ORDER BY

then how it manages to combine inner SELECT inside outer SELECT by using relationships between inner and outer SELECT? Is the outer SELECT runned first and it stops when it hits first element inside a table, and then the inner SELECT starts to run for elements where O2.orderyear <= O1.orderyear is true? Or is there something totally different going on?

Upvotes: 3

Views: 73

Answers (2)

Thorsten Kettner
Thorsten Kettner

Reputation: 95053

You can think of the query in two steps: The subquery in your select clause is correlated, i.e. it refers to the main query via O2.orderyear <= O1.orderyear. So the main query gets executed, and for every single record it reads from the table (aliased O1), it executes the subquery in order to get the running quantity.

So when the main query reads the record for 2008, it executes

SELECT SUM(O2.qty)
FROM Sales.OrderTotalsByYear AS O2
WHERE O2.orderyear <= 2008

There are two records with a year <= 2008: 2007 and 2008. Their quantities are being added, and thus you get the running total to display with the 2008 record. The same is done for the 2007 and the 2009 record. At last your rows are ordered so as to show the years in ascending order.

Upvotes: 0

Damien_The_Unbeliever
Damien_The_Unbeliever

Reputation: 239764

Logically, for each row that the system has generated (via FROM, WHERE, GROUP BY and HAVING), the system will evaluate the SELECT clause. As part of that evaluation of each row, the system will evaluate the correlated subquery:

(
    SELECT
        SUM(O2.qty)
    FROM
        Sales.OrderTotalsByYear AS O2
    WHERE
        O2.orderyear <= O1.orderyear
)

Using the current row's O1.orderyear value.

However, from a practical standpoint1, the system may be able to optimize its evaluation of this subquery. A smart enough optimizer, if the statistics suggest it is worth doing, may decide to evaluate the outer query in orderyear order, and to also create a copy of the OrderTotalsByYear table sorted in orderyear order (or to make use of an index that already represents this sort order). In such a case, the system would be able to evaluate this subquery result without having to re-scan the entire OrderTotalsByYear table for each row of the outer query.

What the optimizer does can only be ascertained by obtaining the execution plan, and will depend on your specific tables - their structure, indexes and the data contained within them.


1SQL is defined in terms of a logical processing order. Implementations are free to perform operations in a different order than the logical processing order, provided they produce the same results which would have been obtained if the logical processing order is followed2. SQL is also defined, generally, to work on sets rather than to specify row-by-row or left-to-right processing.

2SQL Server takes more liberties here than it ought to and may generate errors which would not have been generated had it followed the logical processing order. Ho hum.

Upvotes: 1

Related Questions