Reputation: 99
I need to anonymize data containing clientnumbers. About half of them are duplicate values, as these clients appear more than once. How can I anonymize in R so that duplicates are transformed into the same value?
Thanks in advance!
Upvotes: 3
Views: 696
Reputation: 173813
Suppose your data looks like this:
df <- data.frame(id = c("A", "B", "C", "A", "B", "C"), value = rnorm(6),
stringsAsFactors = FALSE)
df
#> id value
#> 1 A -0.8238857
#> 2 B -0.1553338
#> 3 C -0.6297834
#> 4 A -0.4616377
#> 5 B 0.1643057
#> 6 C -0.6719061
And your list of new ID strings (which can be created randomly - see footnote) looks like this:
newIds <- c("newId1", "newId2", "newId3")
Then you should first ensure that your id
column is a factor:
df$id <- as.factor(df$id)
Then you should probably store the client IDs for safe lookup later
lookup <- data.frame(key = newIds, value = levels(df$id))
lookup
#> key value
#> 1 newId1 A
#> 2 newId2 B
#> 3 newId3 C
Now all you need to do is overwrite the factor levels:
levels(df$id) <- newIds
df
#> id value
#> 1 newId1 0.7241847
#> 2 newId2 0.4313706
#> 3 newId3 -0.8687062
#> 4 newId1 1.3464852
#> 5 newId2 0.6973432
#> 6 newId3 1.9872338
Note: If you want to create random strings for the ids you can do this:
sapply(seq_along(levels(df$id)), function(x) paste0(sample(LETTERS, 5), collapse = ""))
#> [1] "TWABF" "YSBUF" "WVQEY"
Created on 2020-03-02 by the reprex package (v0.3.0)
Upvotes: 4