Reputation: 325
I have a df with data, and a name for each row. I would like the names to be replaced by a random string/number, but with the same string, when a name appears twice or more (eg. for Adam and Camille below).
df <- data.frame("name" = c("Adam", "Adam", "Billy", "Camille", "Camille", "Dennis"), "favourite food" = c("Apples", "Banana", "Oranges", "Banana", "Apples", "Oranges"), stringsAsFactors = F)
The expected output is something like this (it is not important how the random string looks or the lenght of it)
df_exp <- data.frame("name" = c("xxyz", "xxyz", "xyyz", "xyzz", "xyzz", "yyzz"), "favourite food" = c("Apples", "Banana", "Oranges", "Banana", "Apples", "Oranges"), stringsAsFactors = F)
I have tried several random replacement functions in R, however each of them creates a random string for each row in data, and not an individual one for duplicates, eg. stri_rand_strings:
library(stringi)
library(magrittr)
library(tidyr)
library(dplyr)
df <- df %>%
mutate(UniqueID = do.call(paste0, Map(stri_rand_strings, n=6, length=c(2, 6),
pattern = c('[A-Z]', '[0-9]'))))
Upvotes: 1
Views: 1282
Reputation: 206232
One way is with a group_by/mutate
df %>%
group_by(name) %>%
mutate(hidden = stringi::stri_rand_strings(1, length=4)) %>%
ungroup() %>%
mutate(name=hidden)
Basically we just generate one random string per group.
You could also generate a translation table first with something like
new_names <- df %>%
distinct(name) %>%
mutate(new_name = stringi::stri_rand_strings(n(), length=c(2,6)))
and then merge that to the original data. But either way I'm not sure that stri_rand_strings
is guaranteed to return unique values -- they're just random values. While unlikely to be the same, it would be easier to check that they are all distinct by creating the translation table first.
Upvotes: 1