creating a function or sproc to replace real data with pseudo random data

Question

I need to scrub data out of tables that can be classified as PHI.

Let's say the real data looks like this:

+-------------------+-----------+------+
|       Name        | Condition | Code |
+-------------------+-----------+------+
| Alex              | A         |   12 |
| Laura             | BB        |   32 |
| Bob               | AAA       |   55 |
| Gary              | C         |   32 |
+-------------------+-----------+------+

I will need to replace my Name column with the following fake names:

Names_Reference_Table:

+-------------+----+
|    Name     | ID |
+-------------+----+
| JOHN        |  1 |
| ROBERT      |  2 |
| MICHAEL     |  3 |
| WILLIAM     |  4 |
| DAVID       |  5 |
| RICHARD     |  6 |
| CHARLES     |  7 |
| JOSEPH      |  8 |
| THOMAS      |  9 |
| CHRISTOPHER | 10 |
| DANIEL      | 11 |
| PAUL        | 12 |
| MARK        | 13 |
| DONALD      | 14 |
| GEORGE      | 15 |
+-------------+----+

Though I have 10,000,000 real data records, I have only 10,000 fake data records. I am assuming that every fake name will be repeated 1000 times when replacing the real data.

The trick here is that this needs to be recreatable. After googling for hours, I did find some good sources, and I've created some kind of implementation that is deterministic, assuming that the seed does not change:

SELECT FLOOR(RAND(@seed * LOG([p].[patientid]) * 1000) * ( @count_names - 1 ) + 1)

As you can see from above, when you pass the same seed to RAND(), it will return the same value. However, this implementation is brittle.

Has anyone guidance on how to generate this deterministic behavior?

innomatics · Accepted Answer

You could use the checksum or some other hashing function e.g.

SELECT Name 
FROM Names_Reference_Table
WHERE ID = ABS(CHECKSUM('John')) % 10000

Assuming you have 10,000 fake names

creating a function or sproc to replace real data with pseudo random data

Answers (2)

Related Questions