Reputation: 407
I want to delete 10GB (1%) data from 1TB table. I have come across several articles to delete large amounts of data from a huge table but didn't find much on deleting smaller percentage of data from a huge table.
Additional details: Trying to delete bot data from the visits table. The filter condition is a combination of fields... ip in (list of ips about 20 of them) and useragent like '%SOMETHING%'
useragent size 1024 varchar
The data can be old or new. I can't use date filter
Upvotes: 1
Views: 1486
Reputation: 1
I had a similar problem and my solution was that in the dependencies on table from where I was deleting, were some other big tables where pk from my table was as a foreign key. There weren't indexes for it. After I put them in it was much better.
Upvotes: 0
Reputation: 7837
The filter condition is ... ip in (list of ips about 20 of them) and useragent like '%SOMETHING%'
Regarding table size, it's important to touch as few rows as possible while executing the delete.
I imagine on a table that size you already have an index on the ip
column. It might help (or not) to put your list 20 or so ips in a table instead of in an in
clause, especially if they're parameters. I'd look at my query plan to see.
I hope useragent like '%SOMETHING%'
is usually true; otherwise it's an expensive test because SQL Server has to examine every row for an eligible ip
. If not, a redesign to allow the query to avoid like
would probably be beneficial.
[D]eleting smaller percentage isn't really relevant. Using selective search criteria is (per above), as is the size of the delete transaction in absolute terms. By definition, the size of the deletion in terms of rows and row size determines the size of the transaction. Very large transactions can push against machine resources. Breaking them up into smaller ones can yield better performance in such cases.
The last server I used had 0.25 TB RAM and was comfortable deleting 1 million rows at a time, but not 10 million. Your milage will vary; you have to try, and observe, to see.
How much you're willing to tax the machine will depend on what else is (or needs to be able to) run at the same time. The way you break up one logical action -- delete all rows where [condition] -- into "chunks" also depends on what you want the database to look like while the delete procedure is in process, when some chunks are deleted and others remain present.
If you do decide to break it into chunks, I recommend not using a fixed number of rows and a TOP(n)
syntax, because that's the least logical solution. Unless you use order by
, you're leaving to the server to choose arbitrarily which N rows to delete. If you do use order by
, you're requiring the server to sort the result before starting the delete, possibly several times over the whole run. Bleh!
Instead, find some logical subset of rows, ideally distinguishable along the clustered index, that fall beneath your machine's threshold of an acceptable number of rows to delete at one time. Loop over that set. In your case, I would be tempted to iterate over the set of ip
values in the in
clause. Instead of delete ... where ip in(...)
, you get (roughly) for each ip delete ... where ip = @ip
The advantage of the latter approach is that you always know where the database stands. If you kill the procedure or it gets rolled back partway through its iteration, you can examine the database to see which ips still remain (or whatever criteria you end up using). You avoid any kind of pathological behavior, whereby some query gets a partial result because some part of your selection criteria (determined by the server alone) are present and others deleted. In thinking about the problem you can say, I'm unable to delete ip 192.168.0.1 because, without wondering which portion have already been removed.
In sum, I recommend:
Upvotes: 1
Reputation: 171
Here is a batch delete in chunks that I use regularly. Perhaps it would give you some ideas on how to approach your need. I create a stored proc and call the proc from a SQL Agent Job. I generally schedule it to allow a transaction log backup between executions so the log does not grow too large. You could always just run it interactively if you wish.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROC [DBA_Delete_YourTableName] AS
SET NOCOUNT ON;
---------------------------------------------------------
DECLARE @DaysHistoryToKeep INT
SET @DaysHistoryToKeep = 90
IF @DaysHistoryToKeep < 30
SET @DaysHistoryToKeep = 30
---------------------------------------------------------
DECLARE @continue INT
DECLARE @rowcount INT
DECLARE @loopCount INT
DECLARE @MaxLoops INT
DECLARE @TotalRows BIGINT
DECLARE @PurgeThruDate DATETIME
SET @PurgeThruDate = DATEADD(dd,(-1)*(@DaysHistoryToKeep+1), GETDATE())
SET @MaxLoops = 100
SET @continue = 1
SET @loopCount = 0
SELECT @TotalRows = (SELECT COUNT(*) FROM YourTableName (NOLOCK) WHERE CREATEDDATETIME < @PurgeThruDate)
PRINT 'Total Rows = ' + CAST(@TotalRows AS VARCHAR(20))
PRINT ''
WHILE @continue = 1
BEGIN
SET @loopCount = @loopCount + 1
PRINT 'Loop # ' + CAST(@loopCount AS VARCHAR(10))
PRINT CONVERT(VARCHAR(20), GETDATE(), 120)
BEGIN TRANSACTION
DELETE TOP (4500) YourTableName WHERE CREATEDDATETIME < @PurgeThruDate
SET @rowcount = @@rowcount
COMMIT
PRINT 'Rows Deleted: ' + CAST(@rowcount AS VARCHAR(10))
PRINT CONVERT(VARCHAR(20), GETDATE(), 120)
PRINT ''
IF @rowcount = 0 OR @loopCount >= @MaxLoops
BEGIN
SET @continue = 0
END
END
SELECT @TotalRows = (SELECT COUNT(*) FROM YourTableName (NOLOCK) WHERE CREATEDDATETIME < @PurgeThruDate)
PRINT 'Total Rows Remaining = ' + CAST(@TotalRows AS VARCHAR(20))
PRINT ''
GO
Upvotes: 1