Reputation: 1596
I'm working to find potentially duplicated payments / records in an AP system by self-joining on a table containing AP records. Generally, I have a pattern that works pretty well for decently large datasets. The issue arises when my matching criteria broadens and the number of matches per record gets too high (sometimes 20-100 matches per record). The resulting dataset can be massive and attempting to get distinct
values takes too long.
Query that runs fine when attempting to find matches on vendor, invoice number, date, and invoice amount:
SELECT DISTINCT DENSE_RANK() OVER(ORDER BY ap.GtInvoiceNumber,
ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap
INNER JOIN ApEntries sap ON
ap.GtInvoiceNumber = sap.GtInvoiceNumber
AND ap.InvoiceDate = sap.InvoiceDate
AND sap.InvoiceGrossAmount = ap.InvoiceGrossAmount
AND ap.VendorNumber = sap.VendorNumber
AND ap.Id <> sap.Id
WHERE ap.ClientId = @ClientId
This query runs in a matter of seconds because the number of rows that meet the criteria are very low and thus performing the distinct
is simple.
Query that has issues when attempting to find matches on vendor, date, and invoice amount (excluding invoice number from the matching criteria):
SELECT DISTINCT DENSE_RANK() OVER(ORDER BY ap.VendorNumber, ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap
INNER JOIN ApEntries sap ON
ap.VendorNumber = sap.VendorNumber
AND ap.InvoiceDate = sap.InvoiceDate
AND ap.InvoiceGrossAmount = sap.InvoiceGrossAmount
AND (ap.Id <> sap.Id AND ap.TransactionId <> sap.TransactionId)
WHERE ap.ClientId = @ClientId
This query takes on the order of 3-4 minutes to run because the criteria is too broad. Is there a better solution for this in SQL? I recognize that each match will be going both ways and I'll get a record back for each match, but also records for each match within the matched set, so a group of 5 will result in 30 rows, a group of 20 will result in 420 rows, and so on. How can I make this more efficient? I'd like to avoid using cursors, but I recognize that might be my only option.
Thanks for taking the time to read this.
Table is approximately 760k records (all of which are queried in these queried).
As you can see, I'm returning 13,824,170 rows from a table with 760k records. This due to the fact some groups are large (252 members for example will result in 63,756 records).
Relevant Indexes:
CREATE NONCLUSTERED INDEX [IX_ApEntries_DistinctIndex] ON [dbo].[ApEntries]
(
[ClientId] ASC
)
INCLUDE ( [Id],
[VendorNumber],
[InvoiceNumber],
[GtInvoiceNumber],
[InvoiceDate],
[PurchaseOrderNumber],
[InvoiceGrossAmount],
[InvoiceNetAmount],
[CheckDate],
[CheckNumber],
[TransactionId]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
CREATE NONCLUSTERED INDEX [IX_ApEntries_DupRep7] ON [dbo].[ApEntries]
(
[VendorNumber] ASC,
[InvoiceDate] ASC,
[InvoiceGrossAmount] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
CREATE UNIQUE NONCLUSTERED INDEX [IX_ApEntries_Xid] ON [dbo].[ApEntries]
(
[ClientId] ASC,
[TransactionId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
Upvotes: 2
Views: 333
Reputation: 48187
Why not try with ROW_NUMBER()
SELECT *
FROM (
SELECT ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
, row_number() over
(partition by VendorNumber, InvoiceDate, InvoiceGrossAmount
order by TransactionId) rn
WHERE ap.ClientId = @ClientId
) T
WHERE rn > 1;
Upvotes: 1