Reputation: 961
So I'm working on some database 'de-identification' where essentially each piece of information changes. On most of the smaller tables a simple while update wasn't too time consuming (up through tables with 10,000 rows or so. I've now moved to one that has approximately 500,000 rows.
I've read that the fastest way to do this sort of "update" is to in fact just select into a temporary table updating the column you need. (I read it here. Fastest way to update 120 Million records)
The problem with this is that the OP is updating all similar values with a single value where as each of my values are different, ie he updates the null rows in a single column to -1, I'm updating each column in my new row to more or less a random date; This is what I've got so far.
--The only Index on Treatments is a Clustered Primary Key (TreatmentID)
SELECT * INTO #Treatments_temp
FROM Treatments
CREATE CLUSTERED INDEX IDX_Treatments ON #Treatments_temp(TreatmentID)
SET @rows = (SELECT TOP 1 TreatmentID
FROM Treatments
ORDER BY TreatmentID Desc)
WHILE (@rows > 0)
BEGIN
--There are only 500,000 records in this table from count(*) but the PK is much
--higher (some records are deleted, made in error ETC so this if statement is my
--attempt to bypass the code for @rows that don't actually exist.
IF (SELECT TreatmentID FROM #Treatments_temp WHERE TreatmentID = @rows) IS NOT NULL
BEGIN
DECLARE @year INT;
DECLARE @month INT;
DECLARE @date INT;
DECLARE @newStartDate SMALLDATETIME;
DECLARE @multiplier FLOAT;
SET @multiplier = (SELECT RAND());
SET @year = @multiplier * 99 + 1900;
SET @month = @multiplier * 11 + 1;
SET @date = @multiplier * 27 + 1;
SET @newStartDate = DATEADD(MONTH,((@year-1900)*12)+@month-1,@date-1);
UPDATE #Treatments_temp
SET StartDate = @newStartDate
WHERE TreatmentID = @rows
UPDATE #Treatments_temp
SET EndDate = DATEADD(MINUTE, @timebetween, @newStartDate)
WHERE TreatmentID = @rows
END
SET @rows = @rows - 1
END
Upvotes: 2
Views: 2788
Reputation: 6002
I think this should work :
-- using NewID() instead of Rand() because Rand() is only interpreted once for the entre query while NewID() is for each record
-- Based on your logic I understand newStartDate had to be between 1 jan 1801 and 28 dec 1999
DECLARE @multiplier float
DECLARE @max_int float
DECLARE @daterange float
SELECT @max_int = Power(Convert(float, 2), 31), -- signed int !
@daterange = DateDiff(day, '1 jan 1801', '28 dec 1999')
UPDATE Treatments
SET @multiplier = (@max_int - Convert(real, ABS(BINARY_CHECKSUM(NewID())))) / @max_int, -- returns something between 0 and 1
StartDate = DateAdd(day, Convert(int, (@daterange * @multiplier)), '1 jan 1801') -- returns somewhere in the daterange
-- test 'spread'
SELECT COUNT(*), COUNT(DISTINCT StartDate), Min(StartDate), Max(StartDate) FROM Treatments
If anyone wants to test this you can use this to generate some test-data (@Kulingar: make sure not to drop your table by accident =)
IF DB_ID('test') IS NULL CREATE DATABASE test
GO
USE test
GO
IF Object_ID('test..Treatments') IS NOT NULL DROP TABLE test..Treatments
GO
SELECT row_id = IDENTITY(int, 1, 1), StartDate = CURRENT_TIMESTAMP INTO Treatments FROM sys.columns, sys.objects
Upvotes: 1
Reputation: 23300
i'd accomplish this writing a small program which would:
this way you perform your logic outside database and limit crud to the bare essential.
Upvotes: 0
Reputation: 65157
Without knowing more about what you have, I think it would be simplest to:
ID
and the result of the function for each ID
Treatment
table with an INNER JOIN
on the narrow table to get the new valueThere's no need for a row-by-row approach to this.
Upvotes: 2