Reputation: 175

In R: How to consistently replace (/ anonimize) ids or names within two separate data frames

Suppose I have two data frames, df1 and df2. Both data frames have an identifier id. My goal is to merge those data sets on this identifier, but I want to anonimize the names in the id column. However, problem is that I want to so for both data sets individually, thus for df1 and df2 and not directly on df3 (because that would be easy: just replace the id column with some random characters)

I think my solution would need to look something like this. First, I make a separate dataframe consisting of all unique ids from both df1 and df2. Then, I would need to assign some randomization, for example, idxxxx where xxxx is an unique number for each id in this separate data frame. With a dplyr, gsub, or stringr approach I can replace the ids from df1 and df2 according to the value assigned in this separate data frame. After this, I will merge the two data sets.

Here I have two example data frames, my try to solve the problem, and the desired result. Note that the the number of ids does not really matter to me (e.g., it does not matter if John Terry has id0004 or id0003, as long as it is consistently changed within both data frames.

Can someone help me out with this? Thanks!

id <- c("John Williams", "John Terry", "Rick Fire", "Katie Blue", "Unknown")
row1 <- c("28", "17", "17", "29", "39")
df1 <- data.frame(id,row1)

id <- c("Frank Johnson", "John Terry", "Rick Fire", "Katie Blue")
row2 <- c("Purple", "Red", "Yellow", "Green")
df2 <- data.frame(id,row2)

df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)

#My try
#Make separate data frame
id_df <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
id_df <- subset(id_df,TRUE,select = c(id))
id_df$anonymous <- id_df %>% mutate(id = row_number()) #it would be nicer to have something like id0001

#Replace ids within df1 and df2 according to the id_df anonymous variable
library(stringr)
df1$id <- str_replace(df1$id, id_df$id, as.character(id_df$anonymous)) #replacement does not work

#desired result
#df1        row1
#id0003     28 
#id0002     17
#id0005     17
#id0004     29
#id0006     39

#df2        row2
#id0001     Purple
#id0002     Red
#id0005     Yellow
#id0004     Green

#df3
#id         #row1       #row2
#id0001     NA          Purple
#id0002     17          Red
#id0003     28          NA
#id0004     29          Green
#id0005     17          Yellow
#id0006     39          NA

Upvotes: 1

Answers (3)

Chuck P

Reputation: 3923

A tidyverse solution that honors your request not to operate on df3 until you join

id_df <- data.frame(id = union(df1$id,df2$id))
id_df <- 
   id_df %>% 
   mutate(anonymous = paste0("id", stringr::str_pad(row_number(), 
                                                     width = 4, 
                                                     pad = 0)))

newdf1 <- left_join(df1, id_df) %>% select(-id) %>% relocate(anonymous)

newdf2 <- left_join(df2, id_df) %>% select(-id) %>% relocate(anonymous)

full_join(newdf1, newdf2)

#> Joining, by = "anonymous"
#>   anonymous row1   row2
#> 1    id0001   28   <NA>
#> 2    id0002   17    Red
#> 3    id0003   17 Yellow
#> 4    id0004   29  Green
#> 5    id0005   39   <NA>
#> 6    id0006 <NA> Purple

Upvotes: 1

qdread

Reputation: 3953

Here is a solution with all base R (no tidyverse). We create a lookup table with all unique IDs (use the set operation union to find the IDs) and then merge the lookup table with each data frame separately.

# Find all unique ids and create a lookup table.
all_ids <- union(df1$id, df2$id)
id_df <- data.frame(id = all_ids, code = paste0('id', sprintf('%04d', 1:length(all_ids))))

# Merge df1 with the lookup table, remove the id column, and rename the code column to id.
df1 <- merge(df1, id_df, all.x = TRUE)
df1 <- df1[, c('code', 'row1')]
names(df1)[1] <- 'id'

# Repeat for df2
df2 <- merge(df2, id_df, all.x = TRUE)
df2 <- df2[, c('code', 'row2')]
names(df2)[1] <- 'id'

df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)

Note that sprintf('%04d, ...)` will pad the number code with zeroes to a total length of 4.

Upvotes: 3

Allan Cameron

Reputation: 173858

If you want to anonymize the ids and make it reasonably difficult to reverse them, you could calculate the md5 hash of each string. Two identical strings will produce the same md5 hash:

df1$id <- sapply(df1$id, digest::digest, algo = "md5")
df2$id <- sapply(df2$id, digest::digest, algo = "md5")
df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)

df3
#>                                 id row1   row2
#> 1 22e35044ed452870ad5b014e87121d9d   39   <NA>
#> 2 69d61b42a2f549c4765699f06de3b351   28   <NA>
#> 3 ad1cc76e26c5d73ba4a03bf51df1b6af   17 Yellow
#> 4 b3bdcc4913da319308e6ddf47e09da12 <NA> Purple
#> 5 badea53ae1e8a2fa66ebd1cdde9dd413   17    Red
#> 6 d1f305c19a2f9649abe11efcf26ac645   29  Green

Upvotes: 4

In R: How to consistently replace (/ anonimize) ids or names within two separate data frames

Answers (3)

Related Questions