Bruce
Bruce

Reputation: 341

What SQL query operations can change their value dependent on order?

[see addendum below]

Recently I was going through an SQL script as part of a task to check functionality of a data science process. I had a copy of a section of the script which had multiple sub queries and I refactored it to put the sub queries up the top in a with-clause. I usually think of this as an essentially syntactic refactoring operation that is semantically neutral. However, the operation of the script changed.

Investigation showed that it was due to the use of a row number over a partition in which the ordering within the partition was not complete. Changing the structure of the code changed something in the execution plan, that changed the order within the slack left by the incomplete ordering.

I made a note of this point and became less confident of this refactoring, although I hold the position that order should not affect the semantics, at least as long as it can be avoided.

My question is ...

other than assigning a row number, what operations have a value that is changed by the ordering?


I realize now that the question was a bit too open - both answers below were useful to me, but I cannot pick one over the other as THE right answer. I have up-voted both. Thanks. [I rethought that, and will pick one of the answers, rather than none. The one I pick was a bit more on target].

I also realize that the core of the problem was my not having strongly enough in mind that any refactoring can potentially change the contingent order in which the rows are returned. From now on, if I refactor and it changes the result - I will look for issues with ordering.

Upvotes: 3

Views: 204

Answers (4)

Pepik
Pepik

Reputation: 135

Another example would be LISTAGG. I inherited some code that used LISTAGG, but didn't give consistent answers when I tweaked it, because it didn't include the ordering clause: WITHIN GROUP ( ORDER BY ...).

From the Snowflake docs:

If you do not specify the WITHIN GROUP (<orderby_clause>), the order of elements within each list is unpredictable. (An ORDER BY clause outside the WITHIN GROUP clause applies to the order of the output rows, not to the order of the list elements within a row.)

Upvotes: 1

Bruce
Bruce

Reputation: 341

SQL is not based on set theory but on list theory. It is true that many join-based operations have an output such that the underlying bag of elements is a function of the underlying bag of elements in the input - but there are operations, such as row_number() as mentioned, in which this is not the case.

I would like to add a more obscure effect not mentioned in the other answers so far. Floating point arithmetic. Since the order of adding up floating point numbers does actually make a difference, it is possible that using a different ordering clause can produce different floating point values.

In the case mentioned in the posted question, this did actually happen - although only in the 10th decimal place. But that can be enough to change which value is bigger than another, and so make a discrete and significant change to the result of the outermost query.

Upvotes: 2

Lukasz Szozda
Lukasz Szozda

Reputation: 175606

When windowed functions are involved, especially the ROW_NUMBER() the first thing to check is if the columns used for ordering produce a stable sort.

For instance:

CREATE TABLE t(id INT, grp VARCHAR(100), d DATE, val VARCHAR(100));

INSERT INTO t(id, grp, d, val)
VALUES (1, 'grpA', '2021-10-16', 'b')
      ,(2, 'grpA', '2021-10-16', 'a')
      ,(3, 'grpA', '2021-10-15', 'c')
      ,(4, 'grpA', '2021-10-14', 'd')
      ,(5, 'grpB', '2021-10-13', 'a')
      ,(6, 'grpB', '2021-10-13', 'g')
      ,(7, 'grpB', '2021-10-12', 'h');

-- the sort is not stable, d column has a tie
SELECT * 
FROM (
   SELECT t.*, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY d DESC) AS rn
   FROM t) sub
WHERE sub.rn = 1 AND sub.val = 'a';

enter image description here

Depending of the order of operation it could return:

  • 0 rows
  • 1 row (id: 2)
  • 1 row (id: 5)
  • 2 rows(id: 2 and 5)

When query is refactored it could cause choosing a differnt path to access the data thus different result.


To check if sort is stable windowed COUNT could be used using all available columns:

SELECT * 
FROM (
   SELECT t.*, COUNT(*) OVER(PARTITION BY grp, d ) AS cnt
   FROM t) sub
WHERE cnt > 1;

db<>fiddle demo

Upvotes: 5

Simeon Pilgrim
Simeon Pilgrim

Reputation: 25903

So are you saying to had gaps in the row_numbers()? or duplicate row_numbers? or just row numbers jumped around (unstable?)

Which functions are altered by incomplete/unstable order by functions, all the ones where you put OBER BY in the window function. Thus ROW_NUMBER or LAG or LEAD

But in general a sub-select and a CTE (with clause) are the same, the primary difference is multiple things can JOIN the same CTE (thus the Common part) this can be good/bad as you might save on some expensive calculation, but you might also slow down a critical path, and make the whole execution time slower.

Or the data might be a little more processed (due to JOIN's etc) and then the incomplete ODERBY/instability might be exposed.

Upvotes: 3

Related Questions