Milhad
Milhad

Reputation: 91

Efficiently randomize (shuffle) data in Sql Server table

I have a table with data which I have to randomize. By randomizing, I mean use data from random row to update another row in that same column. Problem is that the table itself is big (more than 2 000 000 rows).

I wrote a piece of code which uses while loop, but it's going slow.

Does anyone have any suggestion about more efficient way of achieving randomization?

Upvotes: 5

Views: 27057

Answers (5)

A.Bahrami
A.Bahrami

Reputation: 162

You just order by random in the select statement

Fast and Simple

In SQL Server

SELECT * FROM TableName
order by NEWID()

In Oracle

SELECT * FROM TableName
ORDER BY DBMS_RANDOM.VALUE

In MySQL

SELECT * FROM TableName
ORDER BY RAND()

PostgreSQL

SELECT * FROM TableName
ORDER BY random()

Conclusion

Randomizing the SQL query result set is a common requirement, so it’s useful to know what SQL function you can use in the ORDER BY clause depending on the underlying relational database type.

Upvotes: 2

virtualadrian
virtualadrian

Reputation: 4738

ORIGINAL ANSWER FOR BREVITY:

Based on Mitch Wheats answer linking to this article on scrambline data you can do something like this to scramble a bunch of fields, you're not just limited to the IDs:

;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY [UserID]) AS orig_rownum, 
      ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
      * 
FROM [UserTable]
) 
UPDATE T1 
   SET [UserID] = T2.[UserID]
      ,[FirstName] = T2.[FirstName]
      ,[LastName] = T2.[LastName]
      ,[AddressLine1] =  T2.[AddressLine1]
      ,[AddressLine2] =  T2.[AddressLine2]
      ,[AddressLine3] =  T2.[AddressLine3]
      ,[City] = T2.[City]
      ,[State] = T2.[State]
      ,[Pincode] = T2.[Pincode]
      ,[PhoneNumber] = T2.[PhoneNumber]
      ,[MobileNumber] = T2.[MobileNumber]
      ,[Email] = T2.[Email]
      ,[Status] = T2.[Status] 
FROM Randomize T1 
      join Randomize T2 on T1.orig_rownum = T2.new_rownum 
;

So you aren't just limited to doing this as the article shows:

;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY Id) AS orig_rownum, 
      ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
      * 
FROM [MyTable]
) 
UPDATE T1 SET Id = T2.Id 
FROM Randomize T1 
      join Randomize T2 on T1.orig_rownum = T2.new_rownum 
;

The danger to this approach is the amount of data you are tweaking. Using a CTE crams all that stuff into memory, so while I found that this is fairly speedy ( 19 seconds for a 500k row table ). You will want to be careful if you have a table that has millions of records. You should consider how much data is actually needed or is a good population sample, for testing and development.


EDIT/UPDATED ANSWER:

As I mention in my comment below, I recommend looking into a tool that does this and buying a license. For example: https://www.red-gate.com/products/oracle-development/data-masker-for-oracle/

Sometimes we're budget-constrained and in that situation, this is how I "randomize" data. Below is an example, followed by an explanation/caveat, run on MSSQL 2017:

-- DROP/CREATE FAKEY TEMP TABLE --
IF OBJECT_ID('tempdb..#Employees') IS NOT NULL
    DROP TABLE #Employees

CREATE TABLE #Employees (
     Id INT IDENTITY PRIMARY KEY
    ,EmployeeNumber VARCHAR(20)
    ,UserName VARCHAR(384)
    ,FirstName VARCHAR(128)
    ,MiddleName VARCHAR(128)
    ,LastName VARCHAR(256)
    ,HireDate DATETIME2
    ,BirthDate DATETIME2
    ,Email VARCHAR(384)
)

/*
==== POPULATE FAKEY TEMP TABLE ====
Generated here: https://www.mockaroo.com/
VimFu turns CSV to INSERT statements.
*/
INSERT INTO #Employees VALUES('29-7477088','cstudman0','Moyna','Cobb','Studman','2004-05-03 22:45:55','1991-09-02 18:14:58','[email protected]')
INSERT INTO #Employees VALUES('39-0211165','cgremane1','Cristabel','Carlye','Gremane','2016-06-24 06:31:06','2006-01-09 21:17:28','[email protected]')
INSERT INTO #Employees VALUES('50-1527461','jandree2','Kaja','Justinian','Andree','2020-04-03 06:48:12','2015-10-21 00:48:05','[email protected]')
INSERT INTO #Employees VALUES('53-4435748','pportman3','Zachery','Prudi','Portman','2018-07-23 20:27:23','2017-10-17 18:38:06','[email protected]')
INSERT INTO #Employees VALUES('51-6508890','lnutbeam4','Bobbee','Lilah','Nutbeam','2017-02-03 00:37:52','2002-08-25 11:16:41','[email protected]')
INSERT INTO #Employees VALUES('85-8633318','gleach5','Jaquith','Glenda','Leach','2019-03-12 19:32:33','2008-07-17 08:42:00','[email protected]')
INSERT INTO #Employees VALUES('54-1590858','cwandrach6','Anna-diane','Conrad','Wandrach','1997-12-10 06:09:28','1994-07-22 08:15:27','[email protected]')
INSERT INTO #Employees VALUES('79-7072949','mrankling7','Zeke','Molli','Rankling','2018-12-20 02:35:47','2020-03-10 22:28:33','[email protected]')
INSERT INTO #Employees VALUES('92-8250045','emebius8','Tabor','Eachelle','Mebius','1994-04-15 09:16:44','1995-02-20 17:17:29','[email protected]')
INSERT INTO #Employees VALUES('87-4813998','olyes9','Serene','Ophelie','Lyes','2002-10-08 23:38:19','1996-08-08 03:08:25','[email protected]')
INSERT INTO #Employees VALUES('31-3555215','ykornasa','Faydra','Yancy','Kornas','2004-02-23 12:32:51','1995-10-24 12:27:54','[email protected]')
INSERT INTO #Employees VALUES('16-6787860','pgoldsbyb','Cristian','Phyllis','Goldsby','2012-02-26 11:55:51','1994-03-04 07:18:46','[email protected]')
INSERT INTO #Employees VALUES('11-2332324','bnavarijoc','Werner','Bartlett','Navarijo','2013-02-18 09:02:08','2013-07-19 19:34:39','[email protected]')
INSERT INTO #Employees VALUES('69-9128351','rprobartd','Wolfgang','Ricardo','Probart','1999-01-16 03:33:25','2014-08-02 06:07:44','[email protected]')
INSERT INTO #Employees VALUES('83-0754916','mmaclachlane','Johannah','Mozelle','MacLachlan','2018-09-28 01:02:17','2004-11-11 06:25:56','[email protected]')
INSERT INTO #Employees VALUES('86-9635344','hortegaf','Giulietta','Hetti','Ortega','2003-12-01 07:03:20','2006-11-27 11:53:03','[email protected]')
INSERT INTO #Employees VALUES('87-5426191','akeatingg','Gilbertine','Anabelle','Keating','1991-09-04 07:24:51','2018-09-27 15:30:29','[email protected]')
INSERT INTO #Employees VALUES('94-4568960','aalsinah','Rhona','Alysia','Alsina','1994-12-19 09:09:01','2010-02-12 14:23:18','[email protected]')
INSERT INTO #Employees VALUES('32-6080729','tgarradi','Bank','Trumann','Garrad','2015-06-07 05:42:30','2002-11-06 08:49:17','[email protected]')
INSERT INTO #Employees VALUES('88-0899323','fghentj','Daphene','Feodor','Ghent','2003-04-19 14:19:10','2006-08-07 02:19:51','[email protected]')
INSERT INTO #Employees VALUES('00-7608833','arosebladek','Ed','Andrea','Roseblade','2019-09-05 00:25:28','2006-10-24 07:12:49','[email protected]')
INSERT INTO #Employees VALUES('49-8817469','achallinl','Mandel','Andonis','Challin','2002-12-25 16:34:16','2003-08-15 15:52:04','[email protected]')
INSERT INTO #Employees VALUES('89-3690501','pgurkom','Jock','Peterus','Gurko','1995-09-18 02:50:40','1992-11-14 05:57:15','[email protected]')
INSERT INTO #Employees VALUES('70-7699938','rklimpn','Florida','Ricky','Klimp','1991-11-10 01:36:31','1997-10-18 06:09:28','[email protected]')
INSERT INTO #Employees VALUES('57-7312244','vrentilllo','Tremaine','Verla','Rentilll','2005-07-06 16:21:15','1993-02-04 00:57:48','[email protected]')
INSERT INTO #Employees VALUES('49-5173804','fcarenp','Carly','Fabiano','Caren','2019-10-19 22:59:51','2006-04-28 05:18:18','[email protected]')
INSERT INTO #Employees VALUES('82-6348827','lwhitewoodq','Gavin','Loren','Whitewood','2019-12-02 22:20:23','1998-07-25 13:15:13','[email protected]')
INSERT INTO #Employees VALUES('15-1749593','veronier','Candace','Vincenty','Eronie','2001-08-21 12:18:37','2000-11-09 00:43:32','[email protected]')
INSERT INTO #Employees VALUES('28-1121210','dharuards','Clim','Dionysus','Haruard','1996-02-19 20:43:40','2013-10-26 16:11:15','[email protected]')
INSERT INTO #Employees VALUES('89-0288115','dstigglest','Katalin','Danit','Stiggles','2007-03-25 01:04:44','2000-09-24 00:22:34','[email protected]')
INSERT INTO #Employees VALUES('74-3532636','tmelroseu','Gregorio','Thoma','Melrose','2010-10-26 12:55:55','2012-10-03 09:51:15','[email protected]')
INSERT INTO #Employees VALUES('22-8027830','hbeazev','Peirce','Hettie','Beaze','2001-07-30 16:36:55','1992-07-17 13:21:09','[email protected]')
INSERT INTO #Employees VALUES('62-2571142','etoonw','Carlita','Emmett','Toon','2015-07-13 02:44:51','2001-09-18 07:06:14','[email protected]')
INSERT INTO #Employees VALUES('99-2911468','ayarntonx','Caye','Antonin','Yarnton','2019-02-14 23:26:54','2020-04-08 12:42:46','[email protected]')
INSERT INTO #Employees VALUES('33-8734931','tmationy','Coreen','Tomi','Mation','2011-06-18 09:49:40','2014-05-06 01:34:35','[email protected]')
INSERT INTO #Employees VALUES('89-1230660','gordeltz','Sammie','Granny','Ordelt','1996-05-10 03:58:30','1991-05-18 11:19:27','[email protected]')
INSERT INTO #Employees VALUES('80-9988795','ilaverock10','Selina','Isabelle','Laverock','2015-06-29 12:38:49','2002-07-27 00:43:15','[email protected]')
INSERT INTO #Employees VALUES('61-0387718','sbraithwaite11','Riordan','Sonnnie','Braithwaite','2000-06-08 03:36:23','1998-12-29 14:13:12','[email protected]')
INSERT INTO #Employees VALUES('75-4850210','bolagen12','Lurette','Bradley','O''Lagen','1990-01-05 10:55:13','1995-04-26 17:47:18','[email protected]')
INSERT INTO #Employees VALUES('40-2385038','riron13','Mignon','Ronny','Iron','2004-06-05 19:46:00','2006-05-02 15:17:31','[email protected]')
INSERT INTO #Employees VALUES('97-5320734','kkibbe14','Manon','Kyrstin','Kibbe','1993-02-25 16:49:38','2006-09-03 04:20:53','[email protected]')
INSERT INTO #Employees VALUES('54-6524877','sstorms15','Shaughn','Sandra','Storms','2013-09-30 17:52:42','1996-10-28 03:54:59','[email protected]')
INSERT INTO #Employees VALUES('11-8287102','tkamen16','Allyn','Tim','Kamen','1991-06-16 08:46:49','2020-04-11 01:15:58','[email protected]')
INSERT INTO #Employees VALUES('38-6081847','ralpe17','Cathrin','Rubetta','Alpe','2017-03-26 06:50:37','1993-05-30 05:11:09','[email protected]')
INSERT INTO #Employees VALUES('79-8160581','iparsonson18','Gael','Isabelita','Parsonson','2015-06-28 22:51:35','2001-08-26 14:07:54','[email protected]')
INSERT INTO #Employees VALUES('56-6871425','movendon19','Nikki','Margalit','Ovendon','2019-05-13 15:34:38','2013-10-02 09:38:17','[email protected]')
INSERT INTO #Employees VALUES('86-6665602','nfulep1a','Odessa','Nessa','Fulep','2010-04-21 14:38:58','2002-07-24 13:45:01','[email protected]')
INSERT INTO #Employees VALUES('41-6048604','hstrang1b','Tucky','Hammad','Strang','2011-06-07 02:45:51','2013-10-05 06:21:30','[email protected]')
INSERT INTO #Employees VALUES('55-0900438','hmaccleay1c','Winny','Harmon','MacCleay','1997-02-12 01:56:29','2014-11-22 08:01:34','[email protected]')
INSERT INTO #Employees VALUES('12-7876247','apaty1d','Heidi','Andris','Paty','2006-07-31 02:42:55','2001-06-10 05:38:51','[email protected]')



-- SELECT EVERYTHING BEFORE WE MANGLE DATA --
SELECT *
FROM #Employees


-- MANGLE THE DATA --
UPDATE
     E
SET
     E.EmployeeNumber = (SELECT TOP 1 EmployeeNumber FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.UserName = (SELECT TOP 1 UserName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.FirstName = (SELECT TOP 1 FirstName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.LastName = (SELECT TOP 1 LastName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.MiddleName = (SELECT TOP 1 MiddleName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.HireDate = (SELECT TOP 1 HireDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.BirthDate = (SELECT TOP 1 BirthDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.Email = (SELECT TOP 1 Email FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())

FROM #Employees E

-- SELECT MANGLED DATA --
SELECT *
FROM #Employees

The important part of that whole example is the way we're updating and getting a random value:

UPDATE
     E
SET
     E.EmployeeNumber = (SELECT TOP 1 EmployeeNumber FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.UserName = (SELECT TOP 1 UserName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.FirstName = (SELECT TOP 1 FirstName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.LastName = (SELECT TOP 1 LastName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.MiddleName = (SELECT TOP 1 MiddleName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.HireDate = (SELECT TOP 1 HireDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.BirthDate = (SELECT TOP 1 BirthDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.Email = (SELECT TOP 1 Email FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())

FROM #Employees E

Notice the WHERE in the subselect:

WHERE E.Id = E.Id 

That is merely a trick to force the SQL Server to re-execute the subselect each time. If we don't have that WHERE clause in the subselect we'll get the same row/value because the subselect will only execute once.

You need to consider unique constraints on any of those columns you're "randomizing". It's likely that you're going to have issues with duplicates. In the example above, that could happen against a column like UserName.

Also, depending on your DDL, foreign key constraints could also become problematic when taking this approach.

Lastly, this is far from perfect and if you have super sensitive data like health care information, you should use a certified tool/approach.

Upvotes: 4

Christiaan Westerbeek
Christiaan Westerbeek

Reputation: 11137

I combined the answers that I find above in a single query that randomizes each column again ending up with completely randomized records

UPDATE MyTable SET
  columnA = columnA.newValue,
  columnB = columnB.newValue,
  -- Optionally, for maintaining a group of values like street, zip, city in an address
  columnC = columnGroup.columnC,
  columnD = columnGroup.columnD,
  columnE = columnGroup.columnE
FROM MyTable
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY id) AS rn, id FROM MyTable
) AS PKrows ON MyTable.id = PKrows.id
-- repeat the following JOIN for each column you want to randomize
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnA AS newValue FROM MyTable
) AS columnA ON PKrows.rn = columnA.rn
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnB AS newValue FROM MyTable
) AS columnB ON PKrows.rn = columnB.rn

-- Optionally, if you want to maintain a group of values spread out over several columns
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnC, columnD, columnE FROM MyTable
) AS columnGroup ON PKrows.rn = columnGroup.rn

This query took 8 seconds on a 10K rows table shuffling 8 columns on a Windows 2008 R2 machine with 16GB of memory with 4 XEON cores @ 2.93GHz

Upvotes: 3

Mitch Wheat
Mitch Wheat

Reputation: 300559

In order to update rows, there will be significant processsing time (CPU + I/O) from the updates.

Have you measured the relative expense of randomising the rows versus performing the updates?

In all you need to do is pick random rows, here's an efficient method to pick a random sample of rows (in this case 1% of the rows)

SELECT * FROM myTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), pkID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

where pkID is your primary key column.

This post might be of interest:

Upvotes: 6

gbn
gbn

Reputation: 432271

To shuffle data in 10 columns so that the 10 values per row are replaced with other values from other rows will be expensive.

You have to read 2 million rows 10 times.

The SELECT would be

SELECT
    FirstName, LastName, VIN, ...
FROM
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

I also wouldn't update, I'd create a new table

SELECT
    FirstName, LastName, VIN, ...
INTO
    StagingTable
FROM
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

Then add keys etc, drop the old table, rename it. Or use a SYNONYM to point to the new table

If you want to update, then I'd do it like this. or break it up into 10 updates .

UPDATE
   M
SET
   Firstname = FirstName.FirstName,
   LastName = LastName.LastName,
   ...
FROM
    MyTable M
    JOIN 
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName ON 1=1
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

Upvotes: 2

Related Questions