Karthik
Karthik

Reputation: 99

What is the role of ORDER BY in the PARTITION BY function?

I have a table with data follow,

    ID        SEQ    EFFDAT                 
------- ---------    -----------------------
  1024          1    01/07/2010 12:00:00 AM   
  1024          3    18/04/2017 12:00:00 AM   
  1024          2    01/08/2017 12:00:00 AM   

When I execute the following query, I am getting wrong maximum sequence still I am getting the correct maximum effective date.

Query:

SELECT 
max(seq) over (partition by id order by EFFDAT desc) maxEffSeq,
partitionByTest.*,
max(EFFDAT) over (partition by (id) order by EFFDAT desc ) maxeffdat
FROM partitionByTest;

Output:

 MAXEFFSEQ         ID        SEQ EFFDAT                   MAXEFFDAT              
---------- ---------- ---------- ------------------------ ------------------------
         2       1024          2 01/08/2017 12:00:00 AM   01/08/2017 12:00:00 AM   
         3       1024          3 18/04/2017 12:00:00 AM   01/08/2017 12:00:00 AM   
         3       1024          1 01/07/2010 12:00:00 AM   01/08/2017 12:00:00 AM 

If I remove the order by in my query, I am getting the correct output.

Query:

SELECT max(seq) over (partition by id ) maxEffSeq, partitionByTest.*,
max(EFFDAT) over (partition by (id) order by EFFDAT desc ) maxeffdat
FROM partitionByTest;

Output:

 MAXEFFSEQ         ID        SEQ EFFDAT                   MAXEFFDAT              
---------- ---------- ---------- ------------------------ ------------------------
         3       1024          2 01/08/2017 12:00:00 AM   01/08/2017 12:00:00 AM   
         3       1024          3 18/04/2017 12:00:00 AM   01/08/2017 12:00:00 AM   
         3       1024          1 01/07/2010 12:00:00 AM   01/08/2017 12:00:00 AM   

I know that when we are using MAX function, it is not required to use order by clause. But I am interested to know how order by works in partition by function and why it is giving the wrong result for sequence and correct result for date when I use order by clause ?

Upvotes: 4

Views: 1442

Answers (2)

Alex Poole
Alex Poole

Reputation: 191455

Adding an order by also implies a windowing clause, and as you have't specified one you get the default, so you're really doing:

max(seq) over (
  partition by id
  order by EFFDAT desc
  range between unbounded preceding and current row
)

If you think about how the data looks if you order it in the same way, by descending date:

select partitionbytest.*,
  count(*) over (partition by id order by effdat desc) range_rows,
  max(seq) over (partition by id order by effdat desc) range_max_seq,
  count(*) over (partition by id) id_rows,
  max(seq) over (partition by id) id_max_seq
from partitionbytest
order by effdat desc;

        ID        SEQ EFFDAT     RANGE_ROWS RANGE_MAX_SEQ    ID_ROWS ID_MAX_SEQ
---------- ---------- ---------- ---------- ------------- ---------- ----------
      1024          2 2017-08-01          1             2          3          3
      1024          3 2017-04-18          2             3          3          3
      1024          1 2010-07-01          3             3          3          3

then it becomes a bit clearer. I've included equivalent analytic counts so you can also see how many rows are being considered, with and without the order by clause.

  • For the first row the max seq value is found from looking at that current row's data and all preceding rows with later dates (as it's descending), and there are none of those, so it is the value from that row itself - so it's 2. The rows following it it, with seq values 3 and 1, are not considered.

  • For the second row it looks at the current row and all preceding rows with later dates, so it can consider both the preceding value of 2 and the current value of 3. Since 3 is highest among those, it shows that. The row following it it, with seq value 1, is not considered.

  • For the third row it looks at the current row and all preceding rows with later dates, so it can consider the preceding values of 2 and 3 and the current value of 1. Since 3 is still highest it shows that again.

Without the order by clause it always considers all values for that ID, so it sees 3 as the highest for all of them.

See the documentation for analytic functions for more details of how this is determined, partitularly:

The group of rows is called a window and is defined by the analytic_clause. For each row, a sliding window of rows is defined. The window determines the range of rows used to perform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time.

and

You cannot specify [windowing_clause] unless you have specified the order_by_clause.

and

If you omit the windowing_clause entirely, then the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Upvotes: 7

Andrew
Andrew

Reputation: 27294

This is correct, although it seems very strange.

The order by clause which is permitted on the MAX, is a window function that allow for the order function to also contain a windowing clause - so by specifying an order by clause you then pick up what the default behaviour of the windowing clause would be (since you did not specify it).

The default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Docs : https://docs.oracle.com/database/121/SQLRF/functions004.htm#SQLRF06174

If you omit the windowing_clause entirely, then the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Upvotes: 6

Related Questions