R: How to replace values in column with random numbers WITH duplicates

Question

I have a df with data, and a name for each row. I would like the names to be replaced by a random string/number, but with the same string, when a name appears twice or more (eg. for Adam and Camille below).

df <- data.frame("name" = c("Adam", "Adam", "Billy", "Camille", "Camille", "Dennis"), "favourite food" = c("Apples", "Banana", "Oranges", "Banana", "Apples", "Oranges"), stringsAsFactors = F)

The expected output is something like this (it is not important how the random string looks or the lenght of it)

df_exp <- data.frame("name" = c("xxyz", "xxyz", "xyyz", "xyzz", "xyzz", "yyzz"), "favourite food" = c("Apples", "Banana", "Oranges", "Banana", "Apples", "Oranges"), stringsAsFactors = F)

I have tried several random replacement functions in R, however each of them creates a random string for each row in data, and not an individual one for duplicates, eg. stri_rand_strings:


library(stringi)
library(magrittr)
library(tidyr)
library(dplyr)

df <- df %>%
    mutate(UniqueID = do.call(paste0, Map(stri_rand_strings, n=6, length=c(2, 6),
                                          pattern = c('[A-Z]', '[0-9]'))))

MrFlick · Accepted Answer

One way is with a group_by/mutate

df %>% 
  group_by(name) %>% 
  mutate(hidden = stringi::stri_rand_strings(1, length=4)) %>% 
  ungroup() %>% 
  mutate(name=hidden)

Basically we just generate one random string per group.

You could also generate a translation table first with something like

new_names <- df %>% 
  distinct(name) %>% 
  mutate(new_name = stringi::stri_rand_strings(n(), length=c(2,6)))

and then merge that to the original data. But either way I'm not sure that stri_rand_strings is guaranteed to return unique values -- they're just random values. While unlikely to be the same, it would be easier to check that they are all distinct by creating the translation table first.

R: How to replace values in column with random numbers WITH duplicates

Answers (1)

Related Questions