Ron Skufca
Ron Skufca

Reputation: 2078

Deleting data from a large table

I have a table with about 10 fields to store gps info for customers. Over time as we have added more customers that table has grown to about 14 million rows. As the gps data comes in a service constantly inserts a row into the table. 90% of the data is not revelent i.e. the customer does not care where the vehicle was 3 months ago, but the most recent data is used to generate tracking reports. My goal is to write a sql to perform a purge of the data that is older than a month.

Here is my problem I can NOT use TRUNCATE TABLE as I would lose everything? Yesterday I wrote a delete table statement with a where clause. When I ran it on a test system it locked up my table and the simulation gps inserts were intermittently failing. Also my transaction log grew to over 6GB as it attempted to log each delete.

My first thought was to delete the data a little at a time starting with the oldest first but I was wondering if there was a better way.

Upvotes: 6

Views: 4886

Answers (16)

Jim Neilson
Jim Neilson

Reputation: 1

I'm sharing my solution. I did index the date field. While the procedure was running, I tested getting record counts, inserts, and updates. They were able to complete while the procedure was running. In an Azure managed instance, running on the absolute lowest configuration (General Purpose, 4 cores) I was able to purge 1 million rows in a minute (about 55 seconds).

CREATE PROCEDURE [dbo].[PurgeRecords] (
 @iPurgeDays INT = 2,
 @iDeleteRows INT = 1000,
 @bDebug BIT = 1 --defaults to debug mode
)
AS

SET NOCOUNT ON
DECLARE @iRecCount INT = 0 
DECLARE @iCycles INT = 0
DECLARE @iRowCount INT = 1
DECLARE @dtPurgeDate DATETIME = GETDATE() - @iPurgeDays
SELECT @iRecCount = COUNT(1) FROM YOURTABLE WHERE [Created] <= @dtPurgeDate
SELECT @iCycles = @iRecCount / @iDeleteRows
SET @iCycles = @iCycles + 1  --add one my cycle to get the remainder
--purge the rows in groups
WHILE @iRowCount <= @iCycles
 BEGIN
  BEGIN TRY
   IF @bDebug = 0
    BEGIN
     --delete a group of records
     DELETE TOP (@iDeleteRows) FROM YOURTABLE WHERE [Created] <= @dtPurgeDate
    END
   ELSE
    BEGIN
     --display the delete that would have taken place
     PRINT 'DELETE TOP (' + CONVERT(VARCHAR(10), @iDeleteRows) + ') FROM YOURTABLE WHERE [Created] <= ''' + CONVERT(VARCHAR(25), @dtPurgeDate) + ''''
    END
   SET @iRowCount = @iRowCount + 1

  END TRY
  BEGIN CATCH
   --if there are any issues with the delete, raise error and back out
   RAISERROR('Error purging YOURTABLE Records', 16, 1)
   RETURN
  END CATCH
 END
GO

Upvotes: 0

BradC
BradC

Reputation: 39986

To keep the transaction log from growing out of control, modify it in the following way:

DECLARE @i INT
SET @i = 1
SET ROWCOUNT 10000

WHILE @i > 0
BEGIN
    BEGIN TRAN
        DELETE TOP 1000 FROM dbo.SuperBigTable
        WHERE RowDate < '2009-01-01'
    COMMIT
    SELECT @i = @@ROWCOUNT
END
SET ROWCOUNT 0

And here is a version using the preferred TOP syntax for SQL 2005 and 2008:

DECLARE @i INT
SET @i = 1

WHILE @i > 0
BEGIN
    BEGIN TRAN
        DELETE TOP 1000 FROM dbo.SuperBigTable
        WHERE RowDate < '2009-01-01'
    COMMIT
    SELECT @i = @@ROWCOUNT
END

Upvotes: 0

Ron Skufca
Ron Skufca

Reputation: 2078

I came up with the following T-SQL script which gets an arbitrary amount of recent data.

IF EXISTS(SELECT name FROM sys.tables WHERE name = 'tmp_xxx_tblGPSVehicleInfoLog')
BEGIN
    PRINT 'Dropping temp table tmp_xxx_tblGPSVehicleInfoLog'
    DROP TABLE tmp_xxx_tblGPSVehicleInfoLog
END
GO

PRINT 'Creating temp table tmp_xxx_tblGPSVehicleInfoLog'
CREATE TABLE [dbo].[tmp_xxx_tblGPSVehicleInfoLog](
    [GPSVehicleInfoLogId] [uniqueidentifier] NOT NULL,
    [GPSVehicleInfoId] [uniqueidentifier] NULL,
    [Longitude] [float] NULL,
    [Latitude] [float] NULL,
    [GroundSpeed] [float] NULL,
    [Altitude] [float] NULL,
    [Heading] [float] NULL,
    [GPSDeviceTimeStamp] [datetime] NULL,
    [Milliseconds] [float] NULL,
    [DistanceNext] [float] NULL,
    [UpdateDate] [datetime] NULL,
    [Stopped] [nvarchar](1) NULL,
    [StopTime] [datetime] NULL,
    [StartTime] [datetime] NULL,
    [TimeStopped] [nvarchar](100) NULL
) ON [PRIMARY]
GO

PRINT 'Inserting data from tblGPSVehicleInfoLog to tmp_xxx_tblGPSVehicleInfoLog'
SELECT * INTO tmp_xxx_tblGPSVehicleInfoLog 
FROM tblGPSVehicleInfoLog 
WHERE tblGPSVehicleInfoLog.UpdateDate between '03/30/2009 23:59:59' and '05/19/2009  00:00:00'
GO

PRINT 'Truncating table tblGPSVehicleInfoLog'
TRUNCATE TABLE tblGPSVehicleInfoLog
GO

PRINT 'Inserting data from tmp_xxx_tblGPSVehicleInfoLog to tblGPSVehicleInfoLog'
INSERT INTO tblGPSVehicleInfoLog 
SELECT * FROM tmp_xxx_tblGPSVehicleInfoLog 
GO

Upvotes: 0

Bijimon
Bijimon

Reputation: 322

Performance is pretty fast when dropping a table- even a very large one. So here is what I would do. Script out your table complete with indexes from Management Studio. Edit the script and run it to create a copy of your table. Call it table2. Do a select insert to park the data you want to retain into the new table2. Rename the old table, say tableOld. Rename table2 with the original name. Wait. If no one screams at you drop table2. There is some risk. 1) Check if there are triggers or constraints defined on the original table. They may not get included in the script generated by management studio. 2) if original table has identity fields you may have to turn on identity_insert before inserting into the new table.

Upvotes: 0

Adam Porad
Adam Porad

Reputation: 14471

Does the delete statement use any of the indexes on the table? Often times a huge performance improvement can be obtained by either modifying the statement to use an existing index or to add an index on the table that helps improve the performance of the query that the delete statement does.

Also, as other mentioned, the deletes should be done in multiple chunks instead of one huge statement. This prevents the table from getting locked too long, and having other processes time out waiting for the delete to finish.

Upvotes: 0

S.Lott
S.Lott

Reputation: 391952

Welcome to Data Warehousing. You need to to split your data into two parts.

  • The actual application, with current data only.

  • The history.

You need to do write a little "ETL" job to move data from current to history and delete the history that was moved.

You need to run this periodically. Daily - weekly - monthly quarterly -- doesn't matter technically. What matters is what use the history has and who uses it.

Upvotes: 3

time4tea
time4tea

Reputation: 2197

If you are using oracle, i would set up a partition by date on your tables and the indexes. Then you delete the data by dropping the partition... the data will magically go away with the partition.

This is an easy step - and doesn't clog up your redo logs etc.

There's a basic intro to all this here

Upvotes: 0

Jeff Ferland
Jeff Ferland

Reputation: 18312

Keep in mind that most databases lock the neighboring records in an index during an transaction, so keeping your operations short will be helpful. I'm assuming that your insertions are failing on lock wait timeouts, so delete your data in small, bursty transactions. I'd suggest a single-threaded Perl script that loops through in the oldest 1,000 chunk increments. I hope your primary key (and hopefully clustered index incase they somehow ended up being two different things) can be correlated to time as that would be the best thing to delete by.

PseudoSQL: Select max(primId) < 3_months_ago Delete from table where primId < maxPrimId limit 1000

Now, here's the really fun part: All these deletions MAY make your indexes a mess and require that they be rebuilt to keep the machine from getting slow. In that case, you'll either have to swap in an up-to-date slave, or just suffer some downtime. Make sure you test for this possible case on your test machine.

Upvotes: 0

Dayton Brown
Dayton Brown

Reputation: 1240

I would do a manual delete by day/month (whatever is largest unit you can get away with.) Once you do that first one, then write a stored proc to kick off every day that deletes the oldest data you don't need.

DELETE FROM TABLENAME 
WHERE datediff(day,tableDateTime,getdate() > 90

Personally, I hate doing stuff to production datasets where one missed key results in some really bad things happening.

Upvotes: 2

Eric
Eric

Reputation: 8078

Try this

WHILE EXISTS ( SELECT * FROM table WHERE (condition for deleting))

BEGIN
SET ROWCOUNT 1000
DELETE Table WHERE (condition for deleting)
SET ROWCOUNT 0
ENd

This will delete the rows in groups of 1000

Upvotes: 5

BIBD
BIBD

Reputation: 15404

I assume you can't down the production system (or queue up the GPS results for insertion after the purge is complete).

I'd go with your inclination of deleting a fraction of it at a time (perhaps 10%) depending on the performance you find on your test system.

Is your table indexed? That might help, but the indexing process my have simmilar effects on the system as doing the one great purge.

Upvotes: 1

Aaron Alton
Aaron Alton

Reputation: 23236

If you're using SQL Server 2005 or 2008, sliding window partitioning is the perfect solution for this - instant archiving or purging without any perceptible locking. Have a look here for further information.

Upvotes: 3

Tom H
Tom H

Reputation: 47392

I would probably do it in batches as you have already come up with. Another option would be to insert the important data into another table, truncate the GPS table, then reinsert the important data. You would have a small window where you would be missing the recent historical data. How small that window is would depend on how much data you needed to reinsert. Also, you would need to be careful if the table uses autoincrementing numbers or other defaults so that you use the original values.

Once you have the table cleaned up, a regular cleaning job should be scheduled. You might also want to look into partitioning depending on your RDBMS.

Upvotes: 1

Jhonny D. Cano -Leftware-
Jhonny D. Cano -Leftware-

Reputation: 18013

My 2 cents:

If you are using SQL 2005 and above, you can consider to partition your table based on the date field, so the table doesn't get locked when deleting old records.

Maybe, if you are in position of making dba decisions, you can temporarily change your log model to Simple, so it won't grow up too fast, it will still be growing, but the log won't be too detailed.

Upvotes: 10

Matt
Matt

Reputation: 2055

Better is to create a temporary table and insert only the data you want to keep. Then truncate your original table and copy back the backup.

Oracle syntax (SQL Server is similar)

create table keep as select * from source where data_is_good = 1;
truncate table source;
insert into source select * from keep;

You'll need to disable foreign keys, if there are any on the source table.

In Oracle, index names must be unique in the entire schema, not just per-table. In SQL server, you can further optimize this by just renaming "keep" to "source", as you can easily create indexes of the same name on both tables

Upvotes: 4

Michael Todd
Michael Todd

Reputation: 17071

Can you copy recent data to a new table, truncate the table, then copy it back?

Of course, then you're going to need to worry about doing that again in 6 months or a year.

Upvotes: 2

Related Questions