Reputation: 91
I have a table with data which I have to randomize. By randomizing, I mean use data from random row to update another row in that same column. Problem is that the table itself is big (more than 2 000 000 rows).
I wrote a piece of code which uses while loop, but it's going slow.
Does anyone have any suggestion about more efficient way of achieving randomization?
Upvotes: 5
Views: 27057
Reputation: 162
You just order by random in the select statement
Fast and Simple
SELECT * FROM TableName
order by NEWID()
SELECT * FROM TableName
ORDER BY DBMS_RANDOM.VALUE
SELECT * FROM TableName
ORDER BY RAND()
SELECT * FROM TableName
ORDER BY random()
Randomizing the SQL query result set is a common requirement, so it’s useful to know what SQL function you can use in the ORDER BY clause depending on the underlying relational database type.
Upvotes: 2
Reputation: 4738
ORIGINAL ANSWER FOR BREVITY:
Based on Mitch Wheats answer linking to this article on scrambline data you can do something like this to scramble a bunch of fields, you're not just limited to the IDs:
;WITH Randomize AS
(
SELECT ROW_NUMBER() OVER (ORDER BY [UserID]) AS orig_rownum,
ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum,
*
FROM [UserTable]
)
UPDATE T1
SET [UserID] = T2.[UserID]
,[FirstName] = T2.[FirstName]
,[LastName] = T2.[LastName]
,[AddressLine1] = T2.[AddressLine1]
,[AddressLine2] = T2.[AddressLine2]
,[AddressLine3] = T2.[AddressLine3]
,[City] = T2.[City]
,[State] = T2.[State]
,[Pincode] = T2.[Pincode]
,[PhoneNumber] = T2.[PhoneNumber]
,[MobileNumber] = T2.[MobileNumber]
,[Email] = T2.[Email]
,[Status] = T2.[Status]
FROM Randomize T1
join Randomize T2 on T1.orig_rownum = T2.new_rownum
;
So you aren't just limited to doing this as the article shows:
;WITH Randomize AS
(
SELECT ROW_NUMBER() OVER (ORDER BY Id) AS orig_rownum,
ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum,
*
FROM [MyTable]
)
UPDATE T1 SET Id = T2.Id
FROM Randomize T1
join Randomize T2 on T1.orig_rownum = T2.new_rownum
;
The danger to this approach is the amount of data you are tweaking. Using a CTE crams all that stuff into memory, so while I found that this is fairly speedy ( 19 seconds for a 500k row table ). You will want to be careful if you have a table that has millions of records. You should consider how much data is actually needed or is a good population sample, for testing and development.
As I mention in my comment below, I recommend looking into a tool that does this and buying a license. For example: https://www.red-gate.com/products/oracle-development/data-masker-for-oracle/
Sometimes we're budget-constrained and in that situation, this is how I "randomize" data. Below is an example, followed by an explanation/caveat, run on MSSQL 2017:
-- DROP/CREATE FAKEY TEMP TABLE --
IF OBJECT_ID('tempdb..#Employees') IS NOT NULL
DROP TABLE #Employees
CREATE TABLE #Employees (
Id INT IDENTITY PRIMARY KEY
,EmployeeNumber VARCHAR(20)
,UserName VARCHAR(384)
,FirstName VARCHAR(128)
,MiddleName VARCHAR(128)
,LastName VARCHAR(256)
,HireDate DATETIME2
,BirthDate DATETIME2
,Email VARCHAR(384)
)
/*
==== POPULATE FAKEY TEMP TABLE ====
Generated here: https://www.mockaroo.com/
VimFu turns CSV to INSERT statements.
*/
INSERT INTO #Employees VALUES('29-7477088','cstudman0','Moyna','Cobb','Studman','2004-05-03 22:45:55','1991-09-02 18:14:58','[email protected]')
INSERT INTO #Employees VALUES('39-0211165','cgremane1','Cristabel','Carlye','Gremane','2016-06-24 06:31:06','2006-01-09 21:17:28','[email protected]')
INSERT INTO #Employees VALUES('50-1527461','jandree2','Kaja','Justinian','Andree','2020-04-03 06:48:12','2015-10-21 00:48:05','[email protected]')
INSERT INTO #Employees VALUES('53-4435748','pportman3','Zachery','Prudi','Portman','2018-07-23 20:27:23','2017-10-17 18:38:06','[email protected]')
INSERT INTO #Employees VALUES('51-6508890','lnutbeam4','Bobbee','Lilah','Nutbeam','2017-02-03 00:37:52','2002-08-25 11:16:41','[email protected]')
INSERT INTO #Employees VALUES('85-8633318','gleach5','Jaquith','Glenda','Leach','2019-03-12 19:32:33','2008-07-17 08:42:00','[email protected]')
INSERT INTO #Employees VALUES('54-1590858','cwandrach6','Anna-diane','Conrad','Wandrach','1997-12-10 06:09:28','1994-07-22 08:15:27','[email protected]')
INSERT INTO #Employees VALUES('79-7072949','mrankling7','Zeke','Molli','Rankling','2018-12-20 02:35:47','2020-03-10 22:28:33','[email protected]')
INSERT INTO #Employees VALUES('92-8250045','emebius8','Tabor','Eachelle','Mebius','1994-04-15 09:16:44','1995-02-20 17:17:29','[email protected]')
INSERT INTO #Employees VALUES('87-4813998','olyes9','Serene','Ophelie','Lyes','2002-10-08 23:38:19','1996-08-08 03:08:25','[email protected]')
INSERT INTO #Employees VALUES('31-3555215','ykornasa','Faydra','Yancy','Kornas','2004-02-23 12:32:51','1995-10-24 12:27:54','[email protected]')
INSERT INTO #Employees VALUES('16-6787860','pgoldsbyb','Cristian','Phyllis','Goldsby','2012-02-26 11:55:51','1994-03-04 07:18:46','[email protected]')
INSERT INTO #Employees VALUES('11-2332324','bnavarijoc','Werner','Bartlett','Navarijo','2013-02-18 09:02:08','2013-07-19 19:34:39','[email protected]')
INSERT INTO #Employees VALUES('69-9128351','rprobartd','Wolfgang','Ricardo','Probart','1999-01-16 03:33:25','2014-08-02 06:07:44','[email protected]')
INSERT INTO #Employees VALUES('83-0754916','mmaclachlane','Johannah','Mozelle','MacLachlan','2018-09-28 01:02:17','2004-11-11 06:25:56','[email protected]')
INSERT INTO #Employees VALUES('86-9635344','hortegaf','Giulietta','Hetti','Ortega','2003-12-01 07:03:20','2006-11-27 11:53:03','[email protected]')
INSERT INTO #Employees VALUES('87-5426191','akeatingg','Gilbertine','Anabelle','Keating','1991-09-04 07:24:51','2018-09-27 15:30:29','[email protected]')
INSERT INTO #Employees VALUES('94-4568960','aalsinah','Rhona','Alysia','Alsina','1994-12-19 09:09:01','2010-02-12 14:23:18','[email protected]')
INSERT INTO #Employees VALUES('32-6080729','tgarradi','Bank','Trumann','Garrad','2015-06-07 05:42:30','2002-11-06 08:49:17','[email protected]')
INSERT INTO #Employees VALUES('88-0899323','fghentj','Daphene','Feodor','Ghent','2003-04-19 14:19:10','2006-08-07 02:19:51','[email protected]')
INSERT INTO #Employees VALUES('00-7608833','arosebladek','Ed','Andrea','Roseblade','2019-09-05 00:25:28','2006-10-24 07:12:49','[email protected]')
INSERT INTO #Employees VALUES('49-8817469','achallinl','Mandel','Andonis','Challin','2002-12-25 16:34:16','2003-08-15 15:52:04','[email protected]')
INSERT INTO #Employees VALUES('89-3690501','pgurkom','Jock','Peterus','Gurko','1995-09-18 02:50:40','1992-11-14 05:57:15','[email protected]')
INSERT INTO #Employees VALUES('70-7699938','rklimpn','Florida','Ricky','Klimp','1991-11-10 01:36:31','1997-10-18 06:09:28','[email protected]')
INSERT INTO #Employees VALUES('57-7312244','vrentilllo','Tremaine','Verla','Rentilll','2005-07-06 16:21:15','1993-02-04 00:57:48','[email protected]')
INSERT INTO #Employees VALUES('49-5173804','fcarenp','Carly','Fabiano','Caren','2019-10-19 22:59:51','2006-04-28 05:18:18','[email protected]')
INSERT INTO #Employees VALUES('82-6348827','lwhitewoodq','Gavin','Loren','Whitewood','2019-12-02 22:20:23','1998-07-25 13:15:13','[email protected]')
INSERT INTO #Employees VALUES('15-1749593','veronier','Candace','Vincenty','Eronie','2001-08-21 12:18:37','2000-11-09 00:43:32','[email protected]')
INSERT INTO #Employees VALUES('28-1121210','dharuards','Clim','Dionysus','Haruard','1996-02-19 20:43:40','2013-10-26 16:11:15','[email protected]')
INSERT INTO #Employees VALUES('89-0288115','dstigglest','Katalin','Danit','Stiggles','2007-03-25 01:04:44','2000-09-24 00:22:34','[email protected]')
INSERT INTO #Employees VALUES('74-3532636','tmelroseu','Gregorio','Thoma','Melrose','2010-10-26 12:55:55','2012-10-03 09:51:15','[email protected]')
INSERT INTO #Employees VALUES('22-8027830','hbeazev','Peirce','Hettie','Beaze','2001-07-30 16:36:55','1992-07-17 13:21:09','[email protected]')
INSERT INTO #Employees VALUES('62-2571142','etoonw','Carlita','Emmett','Toon','2015-07-13 02:44:51','2001-09-18 07:06:14','[email protected]')
INSERT INTO #Employees VALUES('99-2911468','ayarntonx','Caye','Antonin','Yarnton','2019-02-14 23:26:54','2020-04-08 12:42:46','[email protected]')
INSERT INTO #Employees VALUES('33-8734931','tmationy','Coreen','Tomi','Mation','2011-06-18 09:49:40','2014-05-06 01:34:35','[email protected]')
INSERT INTO #Employees VALUES('89-1230660','gordeltz','Sammie','Granny','Ordelt','1996-05-10 03:58:30','1991-05-18 11:19:27','[email protected]')
INSERT INTO #Employees VALUES('80-9988795','ilaverock10','Selina','Isabelle','Laverock','2015-06-29 12:38:49','2002-07-27 00:43:15','[email protected]')
INSERT INTO #Employees VALUES('61-0387718','sbraithwaite11','Riordan','Sonnnie','Braithwaite','2000-06-08 03:36:23','1998-12-29 14:13:12','[email protected]')
INSERT INTO #Employees VALUES('75-4850210','bolagen12','Lurette','Bradley','O''Lagen','1990-01-05 10:55:13','1995-04-26 17:47:18','[email protected]')
INSERT INTO #Employees VALUES('40-2385038','riron13','Mignon','Ronny','Iron','2004-06-05 19:46:00','2006-05-02 15:17:31','[email protected]')
INSERT INTO #Employees VALUES('97-5320734','kkibbe14','Manon','Kyrstin','Kibbe','1993-02-25 16:49:38','2006-09-03 04:20:53','[email protected]')
INSERT INTO #Employees VALUES('54-6524877','sstorms15','Shaughn','Sandra','Storms','2013-09-30 17:52:42','1996-10-28 03:54:59','[email protected]')
INSERT INTO #Employees VALUES('11-8287102','tkamen16','Allyn','Tim','Kamen','1991-06-16 08:46:49','2020-04-11 01:15:58','[email protected]')
INSERT INTO #Employees VALUES('38-6081847','ralpe17','Cathrin','Rubetta','Alpe','2017-03-26 06:50:37','1993-05-30 05:11:09','[email protected]')
INSERT INTO #Employees VALUES('79-8160581','iparsonson18','Gael','Isabelita','Parsonson','2015-06-28 22:51:35','2001-08-26 14:07:54','[email protected]')
INSERT INTO #Employees VALUES('56-6871425','movendon19','Nikki','Margalit','Ovendon','2019-05-13 15:34:38','2013-10-02 09:38:17','[email protected]')
INSERT INTO #Employees VALUES('86-6665602','nfulep1a','Odessa','Nessa','Fulep','2010-04-21 14:38:58','2002-07-24 13:45:01','[email protected]')
INSERT INTO #Employees VALUES('41-6048604','hstrang1b','Tucky','Hammad','Strang','2011-06-07 02:45:51','2013-10-05 06:21:30','[email protected]')
INSERT INTO #Employees VALUES('55-0900438','hmaccleay1c','Winny','Harmon','MacCleay','1997-02-12 01:56:29','2014-11-22 08:01:34','[email protected]')
INSERT INTO #Employees VALUES('12-7876247','apaty1d','Heidi','Andris','Paty','2006-07-31 02:42:55','2001-06-10 05:38:51','[email protected]')
-- SELECT EVERYTHING BEFORE WE MANGLE DATA --
SELECT *
FROM #Employees
-- MANGLE THE DATA --
UPDATE
E
SET
E.EmployeeNumber = (SELECT TOP 1 EmployeeNumber FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.UserName = (SELECT TOP 1 UserName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.FirstName = (SELECT TOP 1 FirstName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.LastName = (SELECT TOP 1 LastName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.MiddleName = (SELECT TOP 1 MiddleName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.HireDate = (SELECT TOP 1 HireDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.BirthDate = (SELECT TOP 1 BirthDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.Email = (SELECT TOP 1 Email FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
FROM #Employees E
-- SELECT MANGLED DATA --
SELECT *
FROM #Employees
The important part of that whole example is the way we're updating and getting a random value:
UPDATE
E
SET
E.EmployeeNumber = (SELECT TOP 1 EmployeeNumber FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.UserName = (SELECT TOP 1 UserName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.FirstName = (SELECT TOP 1 FirstName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.LastName = (SELECT TOP 1 LastName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.MiddleName = (SELECT TOP 1 MiddleName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.HireDate = (SELECT TOP 1 HireDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.BirthDate = (SELECT TOP 1 BirthDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
,E.Email = (SELECT TOP 1 Email FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
FROM #Employees E
Notice the WHERE in the subselect:
WHERE E.Id = E.Id
That is merely a trick to force the SQL Server to re-execute the subselect each time. If we don't have that WHERE clause in the subselect we'll get the same row/value because the subselect will only execute once.
You need to consider unique constraints on any of those columns you're "randomizing". It's likely that you're going to have issues with duplicates. In the example above, that could happen against a column like UserName.
Also, depending on your DDL, foreign key constraints could also become problematic when taking this approach.
Lastly, this is far from perfect and if you have super sensitive data like health care information, you should use a certified tool/approach.
Upvotes: 4
Reputation: 11137
I combined the answers that I find above in a single query that randomizes each column again ending up with completely randomized records
UPDATE MyTable SET
columnA = columnA.newValue,
columnB = columnB.newValue,
-- Optionally, for maintaining a group of values like street, zip, city in an address
columnC = columnGroup.columnC,
columnD = columnGroup.columnD,
columnE = columnGroup.columnE
FROM MyTable
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY id) AS rn, id FROM MyTable
) AS PKrows ON MyTable.id = PKrows.id
-- repeat the following JOIN for each column you want to randomize
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnA AS newValue FROM MyTable
) AS columnA ON PKrows.rn = columnA.rn
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnB AS newValue FROM MyTable
) AS columnB ON PKrows.rn = columnB.rn
-- Optionally, if you want to maintain a group of values spread out over several columns
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnC, columnD, columnE FROM MyTable
) AS columnGroup ON PKrows.rn = columnGroup.rn
This query took 8 seconds on a 10K rows table shuffling 8 columns on a Windows 2008 R2 machine with 16GB of memory with 4 XEON cores @ 2.93GHz
Upvotes: 3
Reputation: 300559
In order to update rows, there will be significant processsing time (CPU + I/O) from the updates.
Have you measured the relative expense of randomising the rows versus performing the updates?
In all you need to do is pick random rows, here's an efficient method to pick a random sample of rows (in this case 1% of the rows)
SELECT * FROM myTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), pkID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
where pkID
is your primary key column.
This post might be of interest:
Upvotes: 6
Reputation: 432271
To shuffle data in 10 columns so that the 10 values per row are replaced with other values from other rows will be expensive.
You have to read 2 million rows 10 times.
The SELECT would be
SELECT
FirstName, LastName, VIN, ...
FROM
(SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
JOIN
(SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
JOIN
(SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
JOIN
...
I also wouldn't update, I'd create a new table
SELECT
FirstName, LastName, VIN, ...
INTO
StagingTable
FROM
(SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
JOIN
(SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
JOIN
(SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
JOIN
...
Then add keys etc, drop the old table, rename it. Or use a SYNONYM to point to the new table
If you want to update, then I'd do it like this. or break it up into 10 updates .
UPDATE
M
SET
Firstname = FirstName.FirstName,
LastName = LastName.LastName,
...
FROM
MyTable M
JOIN
(SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName ON 1=1
JOIN
(SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
JOIN
(SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
JOIN
...
Upvotes: 2