Alex Gordon
Alex Gordon

Reputation: 60811

creating a function or sproc to replace real data with pseudo random data

I need to scrub data out of tables that can be classified as PHI.

Let's say the real data looks like this:

+-------------------+-----------+------+
|       Name        | Condition | Code |
+-------------------+-----------+------+
| Alex              | A         |   12 |
| Laura             | BB        |   32 |
| Bob               | AAA       |   55 |
| Gary              | C         |   32 |
+-------------------+-----------+------+

I will need to replace my Name column with the following fake names:

Names_Reference_Table:

+-------------+----+
|    Name     | ID |
+-------------+----+
| JOHN        |  1 |
| ROBERT      |  2 |
| MICHAEL     |  3 |
| WILLIAM     |  4 |
| DAVID       |  5 |
| RICHARD     |  6 |
| CHARLES     |  7 |
| JOSEPH      |  8 |
| THOMAS      |  9 |
| CHRISTOPHER | 10 |
| DANIEL      | 11 |
| PAUL        | 12 |
| MARK        | 13 |
| DONALD      | 14 |
| GEORGE      | 15 |
+-------------+----+

Though I have 10,000,000 real data records, I have only 10,000 fake data records. I am assuming that every fake name will be repeated 1000 times when replacing the real data.

The trick here is that this needs to be recreatable. After googling for hours, I did find some good sources, and I've created some kind of implementation that is deterministic, assuming that the seed does not change:

SELECT FLOOR(RAND(@seed * LOG([p].[patientid]) * 1000) * ( @count_names - 1 ) + 1)

As you can see from above, when you pass the same seed to RAND(), it will return the same value. However, this implementation is brittle.

Has anyone guidance on how to generate this deterministic behavior?

Upvotes: 1

Views: 136

Answers (2)

sam
sam

Reputation: 1304

If I understood it correctly actual names are more than dummy names. So you want to replace the actual names with dummy names by repeating dummy names by 1000 against the real name. This can be achieved by using Ntile function as follows -

declare @i int = (select max(id) from reference_table)
select b.name,a.condition,a.code from
(
select name,condition,code, Ntile(@i) over (order by code) as RefId
from actualTable
) as a
Left join Refrence_table as b
on b.id = a.refid

Let me know if this helps

Upvotes: 1

innomatics
innomatics

Reputation: 369

You could use the checksum or some other hashing function e.g.

SELECT Name 
FROM Names_Reference_Table
WHERE ID = ABS(CHECKSUM('John')) % 10000

Assuming you have 10,000 fake names

Upvotes: 1

Related Questions