webworm
webworm

Reputation: 11019

Masking sensitive data in tables while maintaining relationships

I have a series of tables in SQL Server where I would like to alter sensitive data such as patient names and social security numbers without deleting and then reinserting brand new rows. I simply want to alter the "identifying" information.

I purchased Red-Gate SQL Data Generator hoping it would do the job but it doesn't. It insists on deleting existing rows and then inserting new obfuscated data. I want to retain the relationships between tables (i.e. keep primary and foreign keys). Is there a way to alter/update the existing data while still maintaining readability for testing and development?

Upvotes: 1

Views: 1207

Answers (1)

AHiggins
AHiggins

Reputation: 7219

I experimented with the idea of updating the table using a self-join on a random number, and came up with the following:

DECLARE @PatientData TABLE (ID INT IDENTITY(1,1), PatientName VARCHAR(100), PatientBill DECIMAL(10,2))

INSERT INTO @PatientData (PatientName, PatientBill) -- this will be our unchanging data
SELECT 'Andrew', 100.00
UNION
SELECT 'James', 456.00
UNION
SELECT 'John', 782.00
UNION
SELECT 'Jill', 456.00
UNION
SELECT 'Jack', 157.00
UNION
SELECT 'Julie', 9.00
UNION
SELECT 'Jesse', 75.00
UNION
SELECT 'Jackson', 4563.00
UNION
SELECT 'Johnson', 7783.00
UNION
SELECT 'Jeremy', 9841.00
UNION
SELECT 'Jeff', 4565.00
UNION
SELECT 'Jamie', 75387.00
UNION
SELECT 'Jillian', 86786.00


DECLARE @ScrambledData TABLE (ID INT, PatientName VARCHAR(100), PatientBill DECIMAL(10,2)) -- this one will get scrambled
INSERT INTO @ScrambledData (ID, PatientName, PatientBill)
SELECT ID, PatientName, PatientBill
FROM @PatientData 




UPDATE sd 
    SET sd.PatientName = pd.PatientName
FROM 
    @ScrambledData sd
     INNER JOIN 
      (
        SELECT 
            ROW_NUMBER() OVER (ORDER BY NEWID()) ID,
            PatientName, 
            PatientBill 
        FROM @PatientData
      ) pd ON 
        sd.ID = pd.ID 

UPDATE sd 
    SET sd.PatientBill = pd.PatientBill
FROM 
    @ScrambledData sd
     INNER JOIN 
      (
        SELECT 
            ROW_NUMBER() OVER (ORDER BY NEWID()) ID,
            PatientName, 
            PatientBill 
        FROM @PatientData
      ) pd ON 
        sd.ID = pd.ID 

SELECT *
FROM 
    @PatientData pd
     INNER JOIN 
    @ScrambledData sd ON 
        pd.ID = sd.ID 

In your example, of course, you'd be doing a self-JOIN, but I wanted to use both tables so that I could confirm that it's working correctly. I intentionally issued two UPDATE statements, so you'd be less likely to have a patient name associated with their actual bill, but you could do the same thing with a single UPDATE statement doing a few different JOINs.

You could also run this statement a couple of times, to double-hash the data: I think that's a good idea, though I'm not sure mathematically what the odds are against staying the same over multiple executions (if I'm thinking correctly, a 100-row dataset has a 1% chance of staying the same over a single execution, then a 1:1,000 chance of staying the same over two executions. However, you also run the chance of going back to the record it used to be, which I think is also 1:1,000? Anyone with a better grasp of statistical probability and a higher caffeine level know a solid answer for this?)

Upvotes: 1

Related Questions