Reputation: 109
There's a common situation where we need to look in a single table, and get the record for a single key where another field has the highest value, latest date, or whatever. We can do this using either a subquery selecting the key / max value, then joining that back to the original table, OR we can do it using an OLAP function (RANK() or ROW_NUMBER() as appropriate).
I'm tuning a query which is doing this for a large dataset (40M records). Oracle's EXPLAIN doesn't always reflect the true performance so I'm hesitant to rely on that.
A simplistic example is below. I want to get the order number and miscellaneous other fields for every record where the order date matches the greatest one for that order.
-- Self-join version:
with ords as (
select 'AAA1' as ord_no, to_date('2023-01-01','yyyy-mm-dd') as ord_date, 'A' field1, 'B' field2 from dual union all
select 'AAA1' as ord_no, to_date('2023-02-01','yyyy-mm-dd') as ord_date, 'C' field1, 'D' field2 from dual union all
select 'AAA1' as ord_no, to_date('2023-03-01','yyyy-mm-dd') as ord_date, 'E' field1, 'F' field2 from dual union all
select 'AAA1' as ord_no, to_date('2023-03-01','yyyy-mm-dd') as ord_date, 'E1' field1, 'F1' field2 from dual union all
select 'BBB1' as ord_no, to_date('2023-01-01','yyyy-mm-dd') as ord_date, 'G' field1, 'H' field2 from dual union all
select 'BBB1' as ord_no, to_date('2023-02-01','yyyy-mm-dd') as ord_date, 'I' field1, 'J' field2 from dual union all
select 'BBB1' as ord_no, to_date('2023-03-01','yyyy-mm-dd') as ord_date, 'K' field1, 'L' field2 from dual
),
max_ord as (
select ord_no, max(ord_date) max_ord_date from ords group by ord_no
)
select mo.ord_no, mo.max_ord_date, o.field1, o.field2
from max_ord mo join ords o on mo.ord_no = o.ord_no and mo.max_ord_date = o.ord_Date order by mo.ord_no;
-- OLAP version
-- same data setup as above, omitted for brevity
select ord_no, ord_date, field1, field2 from
(select ord_no, ord_date, field1, field2, rank() over (partition by ord_no order by ord_date desc) maxdt
from ords
)
where maxdt = 1 order by ord_no;
In both cases, the expected result would be something like this:
"ORD_NO" "ORD_DATE" "FIELD1" "FIELD2"
"AAA1" 3/1/2023 "E" "F"
"AAA1" 3/1/2023 "E1" "F1"
"BBB1" 3/1/2023 "K" "L"
Obviously, either approach works; I'm just having trouble finding any good cites as to which (in general) would be more efficient. We're on Oracle; obviously other databases may perform quite differently.
The OLAP version seems, to me, to be a little clearer to read - makes it more obvious what you're doing.
Any guidance or personal experiences would be welcome!
Upvotes: 2
Views: 100
Reputation: 11603
Which is better entirely depends on how many historical versions you want to skip.
If in your example most ORD_NO
values have hundreds or thousands of ORD_DATE
s, and particularly if the table is very wide (many bytes per row) then wanting the last one is going to move you toward a subquery+index method so you don't have to scan the entire table. You're after only a tiny percentage of the table in this case. In this case, an index on both columns (ORD_NO,ORD_DATE)
could satisfy a GROUP BY
query with a parallel fast full scan and then nested loops to the table. Overhinting it primarily just to communicate the intended plan:
SELECT /*+ LEADING(x) USE_NL_WITH_INDEX(o) */
o.*
FROM (SELECT /*+ NO_MERGE INDEX_FFS(o) PARALLEL(8) */
ord_no,
MAX(ord_date) ord_date
FROM ordertable o -- will use only the index for the scan
GROUP BY ord_no) x,
ordertable o -- will index seek after only your target rows
WHERE x.ord_no = o.ord_no
AND x.ord_date = o.ord_date
However, if there are only a small number of historical versions (less than a hundred or so ORD_DATE
values per ORD_NO
) or the table is very narrow (the index wouldn't be significantly smaller), then index access will hurt you because you'll find too many index entries requiring a table visit (each of which is an inefficient entire block read just for 1 row; the inefficiency adds up quickly at high volumes). You want a full table scan in this case, which will make the windowing option attractive because it avoids a second scan and doesn't need the index. You already have that SQL written properly (your second example), though you should/could hint it for parallelism. Windowing functions particularly benefit from parallelism because they are not only processed by multiple slaves but also allocate up to the maximum allowable per-process PGA memory of 1G to each slave, which will be a lot more memory in aggregate than a serialized query can use, thus reducing the need for slow(er) I/O to temp space for the sort workarea those functions require.
Because it's a matter of threshhold, there are no hard and fast rules, only general guidelines. Ultimately a question like this is answered only by actually trying it both ways and seeing which one works best in your environment with your data. There's no "fastest way" to write SQL - what is bad in one situation is great in another. It's all about relative # of rows per step and the resources available to the database to use in the access, join and sort methodologies it has to choose from to process those rows at each step. That means it's all about your data volume, cardinalities, join relationships, indexing, partitioning, compression, query predicates, caching, DB configuration, hardware, etc. Eventually you get to where you can predict how something might act, but you never really know until you actually try it in your environment against your data.
Upvotes: 1