Reputation: 159
I have a table with over 30 million records. When doing insert, I need to avoid the Unique constraint violation.
When I use this NOT EXIST approach, the insert takes forever. In fact, it couldn't finish after 24 hours of running. And I can't use the ignore_row_on_dupkey_index
hint, because this table has more than 1 PK columns.
Another option is to insert in subsets. But I want to know if there's any other way before I do sub-setting.
insert into tlb1 a
select * from tlb2 b
where not exists (select 'x' from tlb1 c
where b.pk = c.pk)
Upvotes: 2
Views: 452
Reputation: 21075
The important decision depends on the numbe rof row inserted, i.e. the number of the rows in the table TBL2
If this number is rather low (say in hundreds to thousands) you may use safely your approach, provided there is an index on the PK
column(s) - whoch should be to enforce the unique constraint.
Please check that the used execution plan is something like the one below
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 110 | 2860 | 113 (0)| 00:00:02 |
| 1 | LOAD TABLE CONVENTIONAL | TBL1 | | | | |
| 2 | NESTED LOOPS ANTI | | 110 | 2860 | 113 (0)| 00:00:02 |
| 3 | TABLE ACCESS FULL | TBL2 | 110 | 1430 | 3 (0)| 00:00:01 |
|* 4 | INDEX UNIQUE SCAN | TBL1_IXD | 1 | 13 | 1 (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("B"."PK"="C"."PK")
The NESTED LOOPS ANTI
means that for each inserted row a single index lookup will be done to check if the key already exists in the target table.
This will work fine for a low number of inserted rows. For a large insert (millions rows) the optimizer will switch to a HASH JOIN RIGHT ANTI
, i.e. all rows from both table will be joined to get th epossible duplicates.
This can take some time (but usually not 24 hours) and the approach with DML Error Logging which eliminates the need of the join.
INSERT INTO tbl1 (pk)
SELECT pk
FROM tbl3
LOG ERRORS INTO err$_tbl1 ('dedup tbl3') REJECT LIMIT UNLIMITED;
This approach will scale well especially when the number of the duplicates is low compared with the number of inserted rows. It is comparable to a normal insert:
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 876K| 10M| 427 (1)| 00:00:06 |
| 1 | LOAD TABLE CONVENTIONAL | TBL1 | | | | |
| 2 | TABLE ACCESS FULL | TBL3 | 876K| 10M| 427 (1)| 00:00:06 |
---------------------------------------------------------------------------------
Upvotes: 1