Efficient Way to Self Join on Table for Various Criteria

Question

Background

I'm working to find potentially duplicated payments / records in an AP system by self-joining on a table containing AP records. Generally, I have a pattern that works pretty well for decently large datasets. The issue arises when my matching criteria broadens and the number of matches per record gets too high (sometimes 20-100 matches per record). The resulting dataset can be massive and attempting to get distinct values takes too long.

Current Solution

Query that runs fine when attempting to find matches on vendor, invoice number, date, and invoice amount:

SELECT DISTINCT DENSE_RANK()  OVER(ORDER BY ap.GtInvoiceNumber,     
ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap 
INNER JOIN ApEntries sap ON 
ap.GtInvoiceNumber = sap.GtInvoiceNumber 
AND ap.InvoiceDate = sap.InvoiceDate 
AND sap.InvoiceGrossAmount = ap.InvoiceGrossAmount 
AND ap.VendorNumber =  sap.VendorNumber
AND ap.Id <> sap.Id 
WHERE ap.ClientId = @ClientId

This query runs in a matter of seconds because the number of rows that meet the criteria are very low and thus performing the distinct is simple.

Query that has issues when attempting to find matches on vendor, date, and invoice amount (excluding invoice number from the matching criteria):

SELECT DISTINCT DENSE_RANK()  OVER(ORDER BY ap.VendorNumber, ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap 
INNER JOIN ApEntries sap ON 
ap.VendorNumber = sap.VendorNumber 
AND ap.InvoiceDate = sap.InvoiceDate 
AND ap.InvoiceGrossAmount =  sap.InvoiceGrossAmount
AND (ap.Id <> sap.Id AND ap.TransactionId <> sap.TransactionId)
WHERE ap.ClientId = @ClientId

This query takes on the order of 3-4 minutes to run because the criteria is too broad. Is there a better solution for this in SQL? I recognize that each match will be going both ways and I'll get a record back for each match, but also records for each match within the matched set, so a group of 5 will result in 30 rows, a group of 20 will result in 420 rows, and so on. How can I make this more efficient? I'd like to avoid using cursors, but I recognize that might be my only option.

Thanks for taking the time to read this.

Additional Info

Table is approximately 760k records (all of which are queried in these queried).

Query Stats for Bad Query

As you can see, I'm returning 13,824,170 rows from a table with 760k records. This due to the fact some groups are large (252 members for example will result in 63,756 records).

Relevant Indexes:

CREATE NONCLUSTERED INDEX [IX_ApEntries_DistinctIndex] ON [dbo].[ApEntries]
(
    [ClientId] ASC
)
INCLUDE (   [Id],
    [VendorNumber],
    [InvoiceNumber],
    [GtInvoiceNumber],
    [InvoiceDate],
    [PurchaseOrderNumber],
    [InvoiceGrossAmount],
    [InvoiceNetAmount],
    [CheckDate],
    [CheckNumber],
    [TransactionId]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE NONCLUSTERED INDEX [IX_ApEntries_DupRep7] ON [dbo].[ApEntries]
(
    [VendorNumber] ASC,
    [InvoiceDate] ASC,
    [InvoiceGrossAmount] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE UNIQUE NONCLUSTERED INDEX [IX_ApEntries_Xid] ON [dbo].[ApEntries]
(
    [ClientId] ASC,
    [TransactionId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

Juan Carlos Oropeza · Accepted Answer

Why not try with ROW_NUMBER()

SELECT *
FROM (
        SELECT ap.Id EntryId
             , ap.TransactionId
             , ap.VendorNumber
             , ap.InvoiceNumber
             , ap.GtInvoiceNumber
             , ap.InvoiceDate
             , ap.PurchaseOrderNumber
             , ap.InvoiceGrossAmount
             , ap.InvoiceNetAmount
             , ap.CheckNumber
             , ap.CheckDate
             , row_number() over 
               (partition by VendorNumber, InvoiceDate, InvoiceGrossAmount  
                order by TransactionId) rn
    WHERE ap.ClientId = @ClientId
   ) T
 WHERE rn > 1;

Efficient Way to Self Join on Table for Various Criteria

Background

Current Solution

Additional Info

Answers (1)

Related Questions