Reputation: 2720
Hi I have a stored procedure that is used to fetch records while searching. This procedure returns millions of records. However there was a bug found inside the search procedure which also return duplicate records in some scenario when certain condition are met. I have found the error why it was returning duplicate records: Below is the query that is in question:
With cteAutoApprove (AcctID, AutoApproved,DecisionDate)
AS (
select
A.AcctID,
CAST(autoEnter AS SMALLINT) AS AutoApproved,
DecisionDate
from
(
SELECT
awt.AcctID,
MIN(awt.dtEnter) AS DecisionDate
FROM
dbo.AccountWorkflowTask awt
JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
Join Task T on T.TaskID = wt.TaskID
WHERE
(
(T.TaskStageID = 3 and awt.ReasonIDExit is NULL)
OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
)
GROUP BY
awt.AcctID
) A
Join AccountWorkflowTask awt1
on awt1.dtEnter=A.DecisionDate and awt1.AcctID=a.AcctID
),
This CTE was returning duplicate record because of the condition on awt1.dtEnter=A.DecisionDate the dtEnter for some account was exactly same. This is the reason it returned duplicate record.
My question is what should I use to prevent this. I cannot use Distinct here as it will definitely slow down the search procedure. Shall I use Rank or Dense Rank so that it is optimized and the query takes less time to execute the result? Or some other technique? Please help as I am actually stuck here
Upvotes: 0
Views: 57
Reputation: 12534
It does seem like a good candidate for row_number (not rank, with the same dates on the same acctid, you'd still have multiple records) Obviously I can't test the query here, but winging it:
select
A.AcctID,
CAST(autoEnter AS SMALLINT) AS AutoApproved,
DecisionDate
from
(
SELECT
awt.AcctID,
awt.dtEnter AS DecisionDate,
autoEnter,
row_number() over (partition by awt.acctid order by awt.dtEnter) rnr
FROM
dbo.AccountWorkflowTask awt
JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
Join Task T on T.TaskID = wt.TaskID
WHERE
(
(T.TaskStageID = 3 and awt.ReasonIDExit is NULL)
OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
)
) A
where rnr = 1
This way, the group by is no longer necessary: getting the first date is done by row_number. Neither is the second join, the subquery already contains all the data (and the optimizer is smart enough not to do anything with the rows it doesn't need)
PS. because sql server window functions work incredibly efficient, using row_number instead of the min() - join construction, will most likely gain a performance boost, even if there were no double rows.
Upvotes: 1