Reputation: 311
I'd like to do what I think is a very simple operation -- adding a column with a number for each person to a dataset with a list of (potentially) duplicative names. I think that I am close. This code looks at a dataset of names, does pairwise comparisons, and appends a column whether there is a likely match. Now I just want to go one step further -- instead of dropping duplicates, I want to come up with a unique identifier.
Peter
Example:
Peter
Peter
Peter
Connor
Matt
would become
Example:
Peter -- 1
Peter -- 1
Peter -- 1
Connor -- 2
Matt -- 3
library(RecordLinkage)
data(RLdata10000)
rpairs <- compare.dedup(RLdata10000, blockfld = 5)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.7)
summary(classify)
match <- classify$prediction
results <- cbind(classify$pairs,match)
Upvotes: 2
Views: 716
Reputation: 51
small rewrite avoiding that the weights and classifier have to be tuned with the IDs,
df_names <- data.frame(Name=c("Peter","Peter","Peter","Connor","Matt"))
df_names %>% compare.dedup() %>%
epiWeights() %>%
epiClassify(0.3) %>%
getPairs(show = "links", single.rows = TRUE) -> matches
left_join(mutate(df_names,ID = 1:nrow(df_names)),
select(matches,id1,id2) %>% arrange(id1) %>% filter(!duplicated(id2)),
by=c("ID"="id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1) ) %>%
select(-id1)
Upvotes: 5
Reputation: 311
I figured out the answer to my own question.
df_names <- df_names %>% mutate(ID = 1:nrow(df_names))
rpairs <- compare.dedup(df_names)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.83)
summary(classify)
matches <- getPairs(classify, show = "links", single.rows = TRUE)
this code writes an "ID" column that is the same for similar names
matches <- matches %>% arrange(ID.1) %>% filter(!duplicated(ID.2))
df_names$ID_prior <- df_names$ID
merge matching information with the original data
df_names <- left_join(df_names, matches %>% select(ID.1,ID.2), by=c("ID"="ID.2"))
replace matches in ID with the thing they match with from ID.1
df_names$ID <- ifelse(is.na(df_names$ID.1), df_names$ID, df_names$ID.1)
Upvotes: 2