Kulingar
Kulingar

Reputation: 961

Updating 500,000 rows in SQL Server with unique data

So I'm working on some database 'de-identification' where essentially each piece of information changes. On most of the smaller tables a simple while update wasn't too time consuming (up through tables with 10,000 rows or so. I've now moved to one that has approximately 500,000 rows.

I've read that the fastest way to do this sort of "update" is to in fact just select into a temporary table updating the column you need. (I read it here. Fastest way to update 120 Million records)

The problem with this is that the OP is updating all similar values with a single value where as each of my values are different, ie he updates the null rows in a single column to -1, I'm updating each column in my new row to more or less a random date; This is what I've got so far.

--The only Index on Treatments is a Clustered Primary Key (TreatmentID)
SELECT * INTO #Treatments_temp
FROM Treatments
CREATE CLUSTERED INDEX IDX_Treatments ON #Treatments_temp(TreatmentID)

SET @rows = (SELECT TOP 1 TreatmentID
             FROM Treatments
             ORDER BY TreatmentID Desc)

WHILE (@rows > 0)
  BEGIN

    --There are only 500,000 records in this table from count(*) but the PK is much 
    --higher (some records are deleted, made in error ETC so this if statement is my
    --attempt to bypass the code for @rows that don't actually exist.

    IF (SELECT TreatmentID FROM #Treatments_temp WHERE TreatmentID = @rows) IS NOT NULL
      BEGIN
      DECLARE @year INT;
      DECLARE @month INT;
      DECLARE @date INT;
      DECLARE @newStartDate SMALLDATETIME;
      DECLARE @multiplier FLOAT;

      SET @multiplier = (SELECT RAND());

      SET @year = @multiplier * 99 + 1900;
      SET @month = @multiplier * 11 + 1;
      SET @date = @multiplier * 27 + 1;

      SET @newStartDate = DATEADD(MONTH,((@year-1900)*12)+@month-1,@date-1);

      UPDATE #Treatments_temp
      SET StartDate = @newStartDate
      WHERE TreatmentID = @rows

      UPDATE #Treatments_temp
      SET EndDate = DATEADD(MINUTE, @timebetween, @newStartDate)
      WHERE TreatmentID = @rows
      END

  SET @rows = @rows - 1
  END

Upvotes: 2

Views: 2788

Answers (3)

deroby
deroby

Reputation: 6002

I think this should work :

-- using NewID() instead of Rand() because Rand() is only interpreted once for the entre query while NewID() is for each record
-- Based on your logic I understand newStartDate had to be between 1 jan 1801 and 28 dec 1999 
DECLARE @multiplier float
DECLARE @max_int    float
DECLARE @daterange  float

SELECT @max_int   = Power(Convert(float, 2), 31), -- signed int !
       @daterange = DateDiff(day, '1 jan 1801', '28 dec 1999')

UPDATE Treatments
   SET @multiplier  = (@max_int - Convert(real, ABS(BINARY_CHECKSUM(NewID())))) / @max_int, -- returns something between 0 and 1
       StartDate    = DateAdd(day, Convert(int, (@daterange * @multiplier)), '1 jan 1801') -- returns somewhere in the daterange

-- test 'spread'
SELECT COUNT(*), COUNT(DISTINCT StartDate), Min(StartDate), Max(StartDate) FROM Treatments

If anyone wants to test this you can use this to generate some test-data (@Kulingar: make sure not to drop your table by accident =)

IF DB_ID('test') IS NULL CREATE DATABASE test
GO
USE test
GO
IF Object_ID('test..Treatments') IS NOT NULL DROP TABLE test..Treatments
GO
SELECT row_id = IDENTITY(int, 1, 1), StartDate = CURRENT_TIMESTAMP INTO Treatments FROM sys.columns, sys.objects

Upvotes: 1

Alex
Alex

Reputation: 23300

i'd accomplish this writing a small program which would:

  1. select * and put in a structure
  2. change data in the structure
  3. replace old data with new data (drop table+create table or truncate+insert into might apply)

this way you perform your logic outside database and limit crud to the bare essential.

Upvotes: 0

JNK
JNK

Reputation: 65157

Without knowing more about what you have, I think it would be simplest to:

  • Put your "randomizing" logic into a scalar function
  • Make a narrow table with just your ID and the result of the function for each ID
  • Update your Treatment table with an INNER JOIN on the narrow table to get the new value

There's no need for a row-by-row approach to this.

Upvotes: 2

Related Questions