sheamus
sheamus

Reputation: 3153

Avoid duplicate keys on INSERT INTO/SELECT

I am trying to do an Insert/Select but I am getting a duplicate key error.

INSERT INTO dbo.DESTINATION_TABLE
                (
                    DocNumber,
                    TenantId,
                    UserId,
                    OtherField
                )
                SELECT
                    po.DocNumber,
                    po.TenantId,
                    po.CreatedById,
                    po.OtherField
                FROM dbo.SOURCE_TABLE po
                WHERE
                    po.DeletedById IS NULL AND
                    NOT EXISTS(
                        SELECT * FROM dbo.DESTINATION_TABLE poa 
                        WHERE
                           poa.DocNumber = po.DocNumber AND
                           poa.TenantId = po.TenantId
                    ) 

DESTINATION_TABLE has a composite primary key of DocNumber and TenantId. DESTINATION_TABLE is empty at the time of running.

SOURCE_TABLE has a primary key of 'SourceTableId'.

But I keep getting an error

Violation of PRIMARY KEY constraint 'PK_dbo.DESTINATION_TABLE'. Cannot insert duplicate key in object 'dbo.DESTINATION_TABLE'. The duplicate key value is (DOC-99, some-tenant).

I have also tried

MERGE INTO DESTINATION_TABLE poa
USING (
    SELECT
        x.DocNumber,
        x.TenantId,
        x.CreatedById,
        x.OtherField
    FROM dbo.SOURCE_TABLE x
    WHERE x.DeletedById IS NULL
) po
ON (poa.TenantId = po.TenantId AND poa.DocNumber = po.DocNumber)
WHEN NOT MATCHED THEN INSERT (DocNumber, TenantId, UserId, OtherField)
VALUES(po.DocNumber, po.TenantId, po.CreatedById, po.OtherField);

But I get the exact same result.

How is this happening? Is it because it is checking for 'NOT EXISTS' before running the insert? How do I fix this?

Upvotes: 0

Views: 987

Answers (2)

Thom A
Thom A

Reputation: 95544

It seems that the flaw here is your understanding of how SQL works. SQL is a set-based language, so it works with the data in sets. For the above, this means that you define the data you want to INSERT with your SELECT and then all the rows that you define are INSERTed. For your EXISTS this means that it only checks against rows that exist in the table prior to any of the rows being inserted, it does not insert each row one at a time and validate the EXISTS prior to each row.

Let's take a very basic example with a single column table:

CREATE TABLE dbo.SomeTable (ID int CONSTRAINT PK_SomeTable PRIMARY KEY);
INSERT INTO dbo.SomeTable (ID)
VALUES(1),(3);

Now, let's say we want to insert the following dataset:

ID
1
2
2
4

If you performed the following query, this will generate the same error you had:

INSERT INTO dbo.SomeTable (ID)
SELECT V.ID
FROM (VALUES(1),(2),(2),(4))V(ID)
WHERE NOT EXISTS (SELECT 1
                  FROM dbo.SomeTable ST
                  WHERE ST.ID = V.ID);

Violation of PRIMARY KEY constraint 'PK_SomeTable'. Cannot insert duplicate key in object 'dbo.SomeTable'. The duplicate key value is (2).

This is because there are two rows with the value 2 for ID that are trying to be INSERTed. Both rows are trying to be INSERTed because there are no rows in the table with the value 2 at the time the INSERT occurs. If you want, you can validate what rows were tried to be INSERTed by commenting out the INSERT clause:

--INSERT INTO dbo.SomeTable (ID)
SELECT V.ID
FROM (VALUES(1),(2),(2),(4))V(ID)
WHERE NOT EXISTS (SELECT 1
                  FROM dbo.SomeTable ST
                  WHERE ST.ID = V.ID);
ID
2
2
4

For your scenario, one method would be to use a CTE with ROW_NUMBER to limit the results to a single row for a single primary key value.

WITH CTE AS
    (SELECT po.DocNumber,
            po.TenantId,
            po.CreatedById,
            po.OtherField,
            ROW_NUMBER() OVER (PARTITION BY po.DocNumber, po.TenantId ORDER BY <Column(s) to ORDER BY to determine desired row>) AS RN
     FROM dbo.SOURCE_TABLE po
     WHERE po.DeletedById IS NULL
       AND NOT EXISTS (SELECT *
                       FROM dbo.DESTINATION_TABLE poa
                       WHERE poa.DocNumber = po.DocNumber
                         AND poa.TenantId = po.TenantId))
INSERT INTO dbo.DESTINATION_TABLE (DocNumber,
                                   TenantId,
                                   UserId,
                                   OtherField)
SELECT DocNumber,
       TenantId,
       CreatedById,
       OtherField
FROM CTE
WHERE RN = 1;

Upvotes: 2

Hogan
Hogan

Reputation: 70513

Run the following query

SELECT po.DocNumber, po.TenantId, COUNT(*) as CNT
FROM dbo.SOURCE_TABLE po
GROUP BY po.DocNumber, po.TenantId
HAVING COUNT(*) > 1

It will show you the rows that are having the issue -- there is more than one of them in the source table.

You then need to figure out what business rule you want to use for the values of items with duplicate rows (x.CreatedById, x.OtherField), maybe you want the first one the max one etc. Then re-write your select query with a group by to only retrieve one row per key.

Upvotes: 0

Related Questions