Chattz
Chattz

Reputation:

How to delete faster?

I have a DB table which consists of 2.5 billion records. There are duplicates to the tune of 11 million. Which is the fastest way to delete these 11 million records?

Upvotes: 6

Views: 3724

Answers (5)

Steve Broberg
Steve Broberg

Reputation: 4394

Whether to delete existing rows or create a proper new table and drops the old one is faster depends on a lot of factors. 11 million rows is a lot, but it's only 0.5% of the total number of rows in the table. It's quite possible that the recreate & drop could be much slower than the delete, depending on how many indexes exist on the source table, as well as where the rows that need deleting exist on the data pages.

Then there's the issue of whether the source table is live or not. If there are inserts & updates going on while this cleanup is occurring, the copy & drop isn't going to work without a fair amount of extra code to sync the table up after the fact.

Finally, why is it necessary that this operation be "fast"? Is it because the system needs to be offline while the process is occurring? You could write a procedure that removes the dupes while the system is live, but doesn't impact the rest of the system in terms of consuming undo. We have solved this problem in the past by first writing a query that collects the primary keys of the rows to be removed in a second table, like so:

  INSERT
    INTO RowsToDeleteTable
  SELECT PKColumn
    FROM SourceTable
   WHERE <conditions used to find rows to remove>

CREATE UNIQUE INDEX PK_RowsToDelete ON RowsToDeleteTable (PKColumn);

Then we have a PL/SQL block that either loops over the rows in a cursor like so:

BEGIN
  FOR theRow IN (SELECT PKColumn FROM RowsToDeleteTable ORDER BY 1) LOOP
    <delete source table for theRow.PKColumn)
    <optionally wait a bit>
    commit;
  END LOOP;
END;

or does something like this:

BEGIN
  FOR theRow IN (SELECT MIN(PKColumn) FROM RowsToDeleteTable ) LOOP
    <delete source table for theRow.PKColumn)
    <optionally wait a bit>
    DELETE RowsToDeleteTable
     WHERE PKColumn = theRow.PKColumn;
    commit;
  END LOOP;
END;

The looping and "SELECT MAX" is obviously less efficient, but it has the advantage of allowing you to follow the progress of the delete operation. We put a bit of wait code in the loop to allow us to control how vigorously the reaping operation occurs.

The initial creation of the RowsToDeleteTable goes very quickly, and you have the advantage of allowing the process to take as long as you want. In case like this, the "holes" left in your extents by the deletes won't be too bad, since you're deleting such a small percentage of the total data.

Upvotes: 2

Charles Bretana
Charles Bretana

Reputation: 146409

First put an index on the column or columns that define and contain the duplicate values,

Then, assumimg the table has a primary key (PK),

  Delete Table T Where PK <> 
        (Select Min(PK) From Table
         Where ColA = T.ColA
           ...  for each column in set defined above
           And ColB = T.ColB)

NOTE: could also use Max(PK), all you're doing is identifying a single record to not delete from each set of duplicates

EDIT: To eliminate the extensive use of the transaction log and the UNDO partition, you could store the values that are dupes in a temp table, and then delete the dupes for each pair within a single transaction...

Assuming only one column (call it ColA, a number) defines the dupes...

   Create Table Dupes (ColA Number)
   Insert Dupes(ColA)
   Select Distinct ColA
   From Table
   Group By ColA
   Having Count(*) > 1

   recordExists Number := 0 ;
   ColAValue Number;
   Select Case When Exists (Select Count(*) From Dupes)
   Then 1 Else 0 End Into recordExists From Dual;


   While recordExists = 1 
      Loop 
         Select (Select Max(ColA) From Dupes) 
         Into ColAValue From Dual;
         Begin Transaction
            Delete Table T
            Where ColA = ColAValue
               And pk <> (Select Min(Pk) From Table 
                          Where ColA = ColAValue);
            Delete Dupes Where ColA = ColAValue;
         Commit Transaction;
         Select Case When Exists (Select Count(*) From Dupes)
         Then 1 Else 0 End Into recordExists From Dual;
      End Loop;

Not tested, so syntax may neeed massaging...

Upvotes: 1

skaffman
skaffman

Reputation: 403441

Deleting one duplicate from many is a tricky business, and with that many records, you have a problem.

One option is to turn the problem on its head, and copy the records you want to keep into a new table. You can use CREATE TABLE AS SELECT DISTINCT ... NOLOGGING syntax, which will copy your de-duplicated records without using the transaction log, which is much faster. Once your new table is populated, delete/rename the old one, and rename the new one into place.

See http://www.databasejournal.com/features/oracle/article.php/3631361/Managing-Tables-Logging-versus-Nologging.htm

Oh, and remember to slap a UNIQUE index on the new table so this doesn't happen again.

The moral of the story is... never use DELETE for getting rid of large numbers of records, it's horrifyingly slow because it has to store all of the deleted records in the redo log. Either copy-and-switch, or TRUNCATE.

Upvotes: 19

Quassnoi
Quassnoi

Reputation: 425251

DELETE
FROM    mytable
WHERE   rowid IN
        (
        SELECT  rowid
        FROM    (
                SELECT  rowid, ROW_NUMBER() OVER (ORDER BY dupfield) rn
                FROM    mytable r
                )
        WHERE   rn > 1
        )

or maybe even this:

DELETE
FROM    mytable mo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    mytable mi
        WHERE   mi.dup_field = mo.dup_field
                AND mi.rowid <> mo.rowid
        )

Both these queries will use quite efficient HASH SEMI JOIN, the latter will be faster if there is no index on dup_field.

You may be tempted to copy the rows, but note that much more REDO and UNDO information will be generated when copying 2G rows than when deleting 11M.

Upvotes: 3

Cătălin Pitiș
Cătălin Pitiș

Reputation: 14341

If you're sure that you don't alter the integrity of the data (referential integrity), disable the constraints (indexes, other constraints), perform the delete, then enable the constraints. You have to try it first, to see whether the refresh of the indexes when enabling is less time consuming than the delete with them enabled.

Some query optimization might also help, but without knowing more details, we are discussing theoretically.

Upvotes: 0

Related Questions