Reputation: 23
Was wondering if anyone knew an efficient way to dedupe records in a large dataset using Oracle SQL based on the max values of 2 attributes in conjunction.
In the hypothetical example below, I am looking remove all duplicate COMPANYID / CHILD ID Pairs by selecting first the maximum transactionid. Where the payload ID still has duplicates, the maximum BATCHID.
note: transactionID and batchID may have null values (which would be expected to the lowest value)
Table: Transaction
<p> CompanyID| ChildID | transactionid| BatchID | Product Details </P>
<p> ABC EFG 306 Product1 </p>
<p>ABC EFG 306 54 Product2</p>
<p>ZXY BFG 405 003 Product1</p>
<p>ZXY BFG 405 004 Product2</p>
<p>ZXY BFG 407 Product3</p>
Expected Result:
<p>ABC | EFG | 306 | 54 | Product 2 --selected on basis of highest transactionid and batchid </P>
<p>ZXY | BFG | 405 | 407 | Product 3 --selected on basis of highest transactionid </p>
I envisioned simply: 1) Using a max function on the transactionid and subquerying the result to max the batchID in addition 2) Self joining the "de-duped' set to the original set to obtain product information
Does anybody know of a more efficient / cleaner way to achieve this and a way to handle the nulls better?
Appreciate any feedback.
Upvotes: 0
Views: 1205
Reputation: 14209
From Oracle 11g, you can use this kind of requests:
with w(CompanyID, ChildID, transactionid, BatchID, Product_Details) as
(
select 'ABC', 'EFG', 306, null, 'Product1 ' from dual
union all
select 'ABC', 'EFG', 306, 54, 'Product2' from dual
union all
select 'ZXY', 'BFG', 405, 003, 'Product1' from dual
union all
select 'ZXY', 'BFG', 405, 004, 'Product2' from dual
union all
select 'ZXY', 'BFG', 407, null, 'Product3' from dual
)
select w.CompanyID,
w.ChildID,
max(w.transactionid) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_transactionid,
max(w.batchid) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_batchid,
max(w.Product_Details) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_Product_Details
from w
group by w.CompanyID, w.ChildID
;
The nvl
function allows you to handle null cases. Here is the output (which does not fit yours, but I did the request as I understood what you wanted):
COMPANYID CHILDID MAX_TRANSACTIONID MAX_BATCHID MAX_PRODUCT_DETAILS
ABC EFG 306 54 Product2
ZXY BFG 407 Product3
EDIT: Let me try to explain further DENSE_RANK
and LAST
: inside a GROUP BY
, this syntax appears as an aggregate function (like SUM, AVG...).
ORDER BY
gives the sorting (here, transactionid and batchid)DENSE_RANK LAST
states that you will focus on the last ranked row(s) of this sorting (you can have indeed several rows with same rank)MAX
takes the maximum value inside these top-ranked rows. Most of the time, you only have one row so MAX
can appear useless, but it is not. So you will often see MIN
and DENSE_RANK FIRST
, or MAX
and DENSE_RANK LAST
.Here is the Oracle doc on this subject.
Upvotes: 2
Reputation: 1270021
Because you are dealing with multiple columns, you should also consider just using row_number()
:
select t.*
from (select t.*,
row_number() over (partition by CompanyId, ChildId
order by transactionid desc nulls last, BatchID desc nulls last
) as seqnum
from t
) t
where seqnum = 1;
The keep/dense_rank
method is fast. I'm not sure if doing it multiple times is faster than using row_number()
. Testing can give you this information.
Upvotes: 1