Reputation: 175
Suppose I have two data frames, df1
and df2
. Both data frames have an identifier id
. My goal is to merge those data sets on this identifier, but I want to anonimize the names in the id
column. However, problem is that I want to so for both data sets individually, thus for df1
and df2
and not directly on df3
(because that would be easy: just replace the id
column with some random characters)
I think my solution would need to look something like this. First, I make a separate dataframe consisting of all unique id
s from both df1
and df2
. Then, I would need to assign some randomization, for example, idxxxx
where xxxx
is an unique number for each id
in this separate data frame. With a dplyr
, gsub
, or stringr
approach I can replace the id
s from df1
and df2
according to the value assigned in this separate data frame. After this, I will merge the two data sets.
Here I have two example data frames, my try to solve the problem, and the desired result. Note that the the number of ids does not really matter to me (e.g., it does not matter if John Terry has id0004 or id0003, as long as it is consistently changed within both data frames.
Can someone help me out with this? Thanks!
id <- c("John Williams", "John Terry", "Rick Fire", "Katie Blue", "Unknown")
row1 <- c("28", "17", "17", "29", "39")
df1 <- data.frame(id,row1)
id <- c("Frank Johnson", "John Terry", "Rick Fire", "Katie Blue")
row2 <- c("Purple", "Red", "Yellow", "Green")
df2 <- data.frame(id,row2)
df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
#My try
#Make separate data frame
id_df <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
id_df <- subset(id_df,TRUE,select = c(id))
id_df$anonymous <- id_df %>% mutate(id = row_number()) #it would be nicer to have something like id0001
#Replace ids within df1 and df2 according to the id_df anonymous variable
library(stringr)
df1$id <- str_replace(df1$id, id_df$id, as.character(id_df$anonymous)) #replacement does not work
#desired result
#df1 row1
#id0003 28
#id0002 17
#id0005 17
#id0004 29
#id0006 39
#df2 row2
#id0001 Purple
#id0002 Red
#id0005 Yellow
#id0004 Green
#df3
#id #row1 #row2
#id0001 NA Purple
#id0002 17 Red
#id0003 28 NA
#id0004 29 Green
#id0005 17 Yellow
#id0006 39 NA
Upvotes: 1
Views: 1286
Reputation: 3923
A tidyverse solution that honors your request not to operate on df3
until you join
id_df <- data.frame(id = union(df1$id,df2$id))
id_df <-
id_df %>%
mutate(anonymous = paste0("id", stringr::str_pad(row_number(),
width = 4,
pad = 0)))
newdf1 <- left_join(df1, id_df) %>% select(-id) %>% relocate(anonymous)
newdf2 <- left_join(df2, id_df) %>% select(-id) %>% relocate(anonymous)
full_join(newdf1, newdf2)
#> Joining, by = "anonymous"
#> anonymous row1 row2
#> 1 id0001 28 <NA>
#> 2 id0002 17 Red
#> 3 id0003 17 Yellow
#> 4 id0004 29 Green
#> 5 id0005 39 <NA>
#> 6 id0006 <NA> Purple
Upvotes: 1
Reputation: 3953
Here is a solution with all base R (no tidyverse). We create a lookup table with all unique IDs (use the set operation union
to find the IDs) and then merge
the lookup table with each data frame separately.
# Find all unique ids and create a lookup table.
all_ids <- union(df1$id, df2$id)
id_df <- data.frame(id = all_ids, code = paste0('id', sprintf('%04d', 1:length(all_ids))))
# Merge df1 with the lookup table, remove the id column, and rename the code column to id.
df1 <- merge(df1, id_df, all.x = TRUE)
df1 <- df1[, c('code', 'row1')]
names(df1)[1] <- 'id'
# Repeat for df2
df2 <- merge(df2, id_df, all.x = TRUE)
df2 <- df2[, c('code', 'row2')]
names(df2)[1] <- 'id'
df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
Note that sprintf('%04d
, ...)` will pad the number code with zeroes to a total length of 4.
Upvotes: 3
Reputation: 173858
If you want to anonymize the ids and make it reasonably difficult to reverse them, you could calculate the md5 hash of each string. Two identical strings will produce the same md5 hash:
df1$id <- sapply(df1$id, digest::digest, algo = "md5")
df2$id <- sapply(df2$id, digest::digest, algo = "md5")
df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
df3
#> id row1 row2
#> 1 22e35044ed452870ad5b014e87121d9d 39 <NA>
#> 2 69d61b42a2f549c4765699f06de3b351 28 <NA>
#> 3 ad1cc76e26c5d73ba4a03bf51df1b6af 17 Yellow
#> 4 b3bdcc4913da319308e6ddf47e09da12 <NA> Purple
#> 5 badea53ae1e8a2fa66ebd1cdde9dd413 17 Red
#> 6 d1f305c19a2f9649abe11efcf26ac645 29 Green
Upvotes: 4