Reputation: 193

Obfuscate / Mask / Scramble personal information

I'm looking for a homegrown way to scramble production data for use in development and test. I've built a couple of scripts that make random social security numbers, shift birth dates, scramble emails, etc. But I've come up against a wall trying to scramble customer names. I want to keep real names so we can still use or searches so random letter generation is out. What I have tried so far is building a temp table of all last names in the table then updating the customer table with a random selection from the temp table. Like this:

DECLARE @Names TABLE (Id int IDENTITY(1,1),[Name] varchar(100))

/* Scramble the last names (randomly pick another last name) */
INSERT @Names SELECT LastName FROM Customer ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID] SET LastName=(SELECT [Name] FROM @Names WHERE ROWID=Id)

This worked well in test, but completely bogs down dealing with larger amounts of data (>20 minutes for 40K rows)

All of that to ask, how would you scramble customer names while keeping real names and the weight of the production data?

UPDATE: Never fails, you try to put all the information in the post, and you forget something important. This data will also be used in our sales & demo environments which are publicly available. Some of the answers are what I am attempting to do, to 'switch' the names, but my question is literally, how to code in T-SQL?

Upvotes: 14

Answers (12)

mnemotronic

Reputation: 1026

I use a method that changes characters in the name to other characters that are in the same "range" of usage frequency in English names. Apparently, the distribution of characters in names is different than it is for normal conversational English. For example, "x" and "z" occur 0.245% of the time, so they get swapped. The the other extreme, "w" is used 5.5% of the time, "s" 6.86% and "t", 15.978%. I change "s" to "w", "t" to "s" and "w" to "t". I keep the vowels "aeio" in a separate group so that a vowel is only replaced by another vowel. Similarly, "q", "u" and "y" are not replaced at all. My grouping and decisions are totally subjective.

I ended up with 7 different "groups" of 2-5 characters , based mostly on frequency. characters within each group are swapped with other chars in that same group.

The net result is names that kinda look like the might be names, but from "not around here".

Original name     Morphed name
Loren             Nimag
Juanita           Kuogewso
Tennyson          Saggywig
David             Mijsm
Julie             Kunewa

Here's the SQL I use, which includes a "TitleCase" function. There are 2 different versions of the "morphed" name based on different frequencies of letters I found on the web.

--    from     https://stackoverflow.com/a/28712621

-- Convert and return param as Title Case

CREATE FUNCTION [dbo].[fnConvert_TitleCase] (@InputString VARCHAR(4000) )
RETURNS VARCHAR(4000)AS
BEGIN
DECLARE @Index INT
DECLARE @Char CHAR(1)
DECLARE @OutputString VARCHAR(255)

SET @OutputString = LOWER(@InputString)
SET @Index = 2
SET @OutputString = STUFF(@OutputString, 1, 1,UPPER(SUBSTRING(@InputString,1,1)))

WHILE @Index <= LEN(@InputString)
BEGIN
   SET @Char = SUBSTRING(@InputString, @Index, 1)
   IF @Char IN (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&','''','(','{','[','@')
      IF @Index + 1 <= LEN(@InputString)
      BEGIN
         IF @Char != ''''  OR  UPPER(SUBSTRING(@InputString, @Index + 1, 1)) != 'S'
            SET @OutputString = STUFF(@OutputString, @Index + 1, 1,UPPER(SUBSTRING(@InputString, @Index + 1, 1)))
      END
         SET @Index = @Index + 1
      END

   RETURN ISNULL(@OutputString,'')

END
Go

--    00.045 x 0.045%
--    00.045 z 0.045%
--
--    Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z')
--
--    00.456 k 0.456%
--    00.511 j 0.511%
--    00.824 v 0.824%
--    kjv
--    Replace(Replace(Replace(Replace(TS_NAME,'k','#'),'j','k'),'v','j'),'#','v')
--
--    01.642 g 1.642%
--    02.284 n 2.284%
--    02.415 l 2.415%
--    gnl
--    Replace(Replace(Replace(Replace(TS_NAME,'g','#'),'n','g'),'l','n'),'#','l')
--
--    02.826 r 2.826%
--    03.174 d 3.174%
--    03.826 m 3.826%
--    rdm
--    Replace(Replace(Replace(Replace(TS_NAME,'r','#'),'d','r'),'m','d'),'#','m')
--
--    04.027 f 4.027%
--    04.200 h 4.200%
--    04.319 p 4.319%
--    04.434 b 4.434%
--    05.238 c 5.238%
--    fhpbc
--    Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c')
--
--    05.497 w 5.497%
--    06.686 s 6.686%
--    15.978 t 15.978%
--    wst
--    Replace(Replace(Replace(Replace(TS_NAME,'w','#'),'s','w'),'t','s'),'#','t')
--
--
--    02.799 e 2.799%
--    07.294 i 7.294%
--    07.631 o 7.631%
--    11.682 a 11.682%
--    eioa
--    Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')
--
--    -- dont replace
--    00.222 q 0.222%
--    00.763 y 0.763%
--    01.183 u 1.183%

-- Obfuscate a name
Select
   ts_id,
   Cast(ts_name as varchar(42)) as [Original Name]

   Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z'),'k','#'),'j','k'),'v','j'),'#','v'),'g','#'),'n','g'),'l','n'),'#','l'),'r','#'),'d','r'),'m','d'),'#','m'),'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c'),'w','#'),'s','w'),'t','s'),'#','t'),'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')) as VarChar(42)) As [morphed name] ,
   Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','t'),'~','e'),'t','~'),'a','o'),'~','a'),'o','~'),'i','n'),'~','i'),'n','~'),'s','h'),'~','s'),'h','r'),'r','~'),'d','l'),'~','d'),'l','~'),'m','w'),'~','m'),'w','f'),'f','~'),'g','y'),'~','g'),'y','p'),'p','~'),'b','v'),'~','b'),'v','k'),'k','~'),'x','~'),'j','x'),'~','j')) as VarChar(42)) As [morphed name2]

From
   ts_users
;

Upvotes: 3

elifiner

Reputation: 7575

Encountered the same problem myself and figured out an alternative solution that may work for others.

The idea is to use MD5 on the name and then take the last 3 hex digits of it to map into a table of names. You can do this separately for first name and last name.

3 hex digits represent decimals from 0 to 4095, so we need a list of 4096 first names and 4096 last names.

So conv(substr(md5(first_name), 3),16,10) (in MySQL syntax) would be an index from 0 to 4095 that could be joined with a table that holds 4096 first names. The same concept could be applied to last names.

Using MD5 (as opposed to a random number) guarantees a name in the original data will always be mapped to the same name in the test data.

You can get a list of names here:

https://gist.github.com/elifiner/cc90fdd387449158829515782936a9a4

Upvotes: 1

S3S

Reputation: 25112

Here's a way using ROT47 which is reversible, and another which is random. You can add a PK to either to link back to the "un scrambled" versions

declare @table table (ID int, PLAIN_TEXT nvarchar(4000))
insert into @table
values
(1,N'Some Dudes name'),
(2,N'Another Person Name'),
(3,N'Yet Another Name')

--split your string into a column, and compute the decimal value (N) 
if object_id('tempdb..#staging') is not null drop table #staging
select 
    substring(a.b, v.number+1, 1) as Val
    ,ascii(substring(a.b, v.number+1, 1)) as N
    --,dense_rank() over (order by b) as RN
    ,a.ID
into #staging
from (select PLAIN_TEXT b, ID FROM @table) a
    inner join
         master..spt_values v on v.number < len(a.b)
where v.type = 'P' 

--select * from #staging


--create a fast tally table of numbers to be used to build the ROT-47 table.

;WITH
    E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
    cteTally(N) AS 
    (
        SELECT  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
    )



--Here we put it all together with stuff and FOR XML
select 
    PLAIN_TEXT
    ,ENCRYPTED_TEXT =
        stuff((
        select
            --s.Val
            --,s.N
            e.ENCRYPTED_TEXT
        from #staging s
        left join(
        select 
            N as DECIMAL_VALUE
            ,char(N) as ASCII_VALUE
            ,case 
                when 47 + N <= 126 then char(47 + N)
                when 47 + N > 126 then char(N-47)
            end as ENCRYPTED_TEXT
        from cteTally
        where N between 33 and 126) e on e.DECIMAL_VALUE = s.N
        where s.ID = t.ID
        FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from @table t


--or if you want really random
select 
    PLAIN_TEXT
    ,ENCRYPTED_TEXT =
        stuff((
        select
            --s.Val
            --,s.N
            e.ENCRYPTED_TEXT
         from #staging s
        left join(
        select 
            N as DECIMAL_VALUE
            ,char(N) as ASCII_VALUE
            ,char((select ROUND(((122 - N -1) * RAND() + N), 0))) as ENCRYPTED_TEXT
        from cteTally
        where (N between 65 and 122) and N not in (91,92,93,94,95,96)) e on e.DECIMAL_VALUE = s.N
        where s.ID = t.ID
        FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from @table t

Upvotes: 1

Josh

Reputation: 4438

Another site to generate shaped fake data sets, with an option for T-SQL output: https://mockaroo.com/

Upvotes: 2

AUR

Reputation: 633

The following approach worked for us, lets say we have 2 tables Customers and Products:

CREATE FUNCTION [dbo].[GenerateDummyValues]
(
    @dataType varchar(100),
    @currentValue varchar(4000)=NULL
)
RETURNS varchar(4000)
AS
BEGIN
IF @dataType = 'int'
    BEGIN
        Return '0'
    END
ELSE IF @dataType = 'varchar' OR @dataType = 'nvarchar' OR @dataType = 'char' OR @dataType = 'nchar'
    BEGIN
        Return 'AAAA'
    END
ELSE IF @dataType = 'datetime'
    BEGIN
        Return Convert(varchar(2000),GetDate())
    END
-- you can add more checks, add complicated logic etc
Return 'XXX'
END

The above function will help in generating different data based on the data type coming in.

Now, for each column of each table which does not have word "id" in it, use following query to generate further queries to manipulate the data:

select 'select ''update '' + TABLE_NAME + '' set '' + COLUMN_NAME + '' = '' +  '''''''' + dbo.GenerateDummyValues( Data_type,'''') + '''''' where id = '' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, ' + table_name + ' where RIGHT(LOWER(COLUMN_NAME),2) <> ''id'' and TABLE_NAME = '''+ table_name + '''' + ';' from  INFORMATION_SCHEMA.TABLES;

When you execute above query it will generate update queries for each table and for each column of that table, for example:

select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' +  '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Customers where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Customers';
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' +  '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Products where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Products';

Now, when you execute above queries you will get final update queries, that will update the data of your tables.

You can execute this on any SQL server database, no matter how many tables do you have, it will generate queries for you that can be further executed.

Hope this helps.

Upvotes: 1

Tom Powers

Reputation:

Use a temporary table instead and the query is very fast. I just ran on 60K rows in 4 seconds. I'll be using this one going forward.

DECLARE TABLE #Names 
(Id int IDENTITY(1,1),[Name] varchar(100))

/* Scramble the last names (randomly pick another last name) */

INSERT #Names
  SELECT LastName 
  FROM Customer 
  ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)

UPDATE [Customer ORDERED BY ROWID] 

SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)

DROP TABLE #Names

Upvotes: 1

Nick Perkins

Reputation: 8294

I am working on this at my company right now -- and it turns out to be a very tricky thing. You want to have names that are realistic, but must not reveal any real personal info.

My approach has been to first create a randomized "mapping" of last names to other last names, then use that mapping to change all last names. This is good if you have duplicate name records. Suppose you have 2 "John Smith" records that both represent the same real person. If you changed one record to "John Adams" and the other to "John Best", then your one "person" now has 2 different names! With a mapping, all occurrences of "Smith" get changed to "Jones", and so duplicates ( or even family members ) still end up with the same last name, keeping the data more "realistic".

I will also have to scramble the addresses, phone numbers, bank account numbers, etc...and I am not sure how I will approach those. Keeping the data "realistic" while scrambling is certainly a deep topic. This must have been done many times by many companies -- who has done this before? What did you learn?

Upvotes: 0

Milan Babuškov

Reputation: 61138

When doing something like that I usually write a small program that first loads a lot of names and surnames in two arrays, and then just updates the database using random name/surname from arrays. It works really fast even for very big datasets (200.000+ records)

Upvotes: 2

Jeff

Reputation: 2871

Frankly, I'm not sure why this is needed. Your dev/test environments should be private, behind your firewall, and not accessible from the web.

Your developers should be trusted, and you have legal recourse against them if they fail to live up to your trust.

I think the real question should be "Should I scramble the data?", and the answer is (in my mind) 'no'.

If you're sending it offsite for some reason, or you have to have your environments web-accessible, or if you're paranoid, I would implement a random switch. Rather than build a temp table, run switches between each location and a random row in the table, swapping one piece of data at a time.

The end result will be a table with all the same data, but with it randomly reorganized. It should also be faster than your temp table, I believe.

It should be simple enough to implement the Fisher-Yates Shuffle in SQL...or at least in a console app that reads the db and writes to the target.

Edit (2): Off-the cuff answer in T-SQL:

declare @name varchar(50) set @name = (SELECT lastName from person where personID = (random id number) Update person set lastname = @name WHERE personID = (person id of current row)

Wrap this in a loop, and follow the guidelines of Fisher-Yates for modifying the random value constraints, and you'll be set.

Upvotes: -2

Ryan

Reputation: 9928

Why not just use some sort of Random Name Generator?

Upvotes: 0

Peter Hoffmann

Reputation: 58664

I use generatedata. It is an open source php script which can generate all sorts of dummy data.

Upvotes: 6

warren

Reputation: 33445

A very simple solution would be to ROT13 the text.

A better question may be why you feel the need to scramble the data? If you have an encryption key, you could also consider running the text through DES or AES or similar. Thos would have potential performance issues, however.

Upvotes: 2

Obfuscate / Mask / Scramble personal information

Answers (12)

Related Questions