E. van Dongen
E. van Dongen

Reputation: 99

How to anonymize data without losing duplicates

I need to anonymize data containing clientnumbers. About half of them are duplicate values, as these clients appear more than once. How can I anonymize in R so that duplicates are transformed into the same value?

Thanks in advance!

Upvotes: 3

Views: 696

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173813

Suppose your data looks like this:

df <- data.frame(id = c("A", "B", "C", "A", "B", "C"), value = rnorm(6),
                 stringsAsFactors = FALSE)
df
#>   id      value
#> 1  A -0.8238857
#> 2  B -0.1553338
#> 3  C -0.6297834
#> 4  A -0.4616377
#> 5  B  0.1643057
#> 6  C -0.6719061

And your list of new ID strings (which can be created randomly - see footnote) looks like this:

newIds <- c("newId1", "newId2", "newId3")

Then you should first ensure that your id column is a factor:

df$id <- as.factor(df$id)

Then you should probably store the client IDs for safe lookup later

lookup <- data.frame(key = newIds, value = levels(df$id))
lookup
#>      key value
#> 1 newId1     A
#> 2 newId2     B
#> 3 newId3     C

Now all you need to do is overwrite the factor levels:

levels(df$id) <- newIds

df
#>       id      value
#> 1 newId1  0.7241847
#> 2 newId2  0.4313706
#> 3 newId3 -0.8687062
#> 4 newId1  1.3464852
#> 5 newId2  0.6973432
#> 6 newId3  1.9872338

Note: If you want to create random strings for the ids you can do this:

sapply(seq_along(levels(df$id)), function(x) paste0(sample(LETTERS, 5), collapse = ""))
#> [1] "TWABF" "YSBUF" "WVQEY"

Created on 2020-03-02 by the reprex package (v0.3.0)

Upvotes: 4

Related Questions