Reputation: 21
I have a database with three columns: name
, occupation
, and organization
. In these columns, I have duplicates with slightly different names. For example, Anne Sue Frank
and Anne S. Frank
refer to the same person, as they have the same occupation and belong to the same organization.
Is there a way for me to create another table that maps these names to their corresponding matches? I tried using stringdist, but it mostly matches names with themselves (e.g., Anne Sue Frank
with Anne Sue Frank
that are the same row), which I don't want. My objetive is finding the duplicate names, so I would need a new table showing the name in a column with their correspondet in the second column.
Example:
df1 <- data.frame(
name = c("Anne Sue Frank", "John S. Gooble", "Anne S. Frank", "Johnatan Sue Google"),
organization = c("ABC", "FCV", "ABC", "FCV"),
occupation = c("director", "teacher", "director", "teacher"),
stringsAsFactors = FALSE
)
df1
name organization occupation
1 Anne Sue Frank ABC director
2 John S. Gooble FCV teacher
3 Anne S. Frank ABC director
4 Johnatan Sue Google FCV teacher
Upvotes: 1
Views: 53
Reputation: 79246
Here is a solution with stringdist
First with inner_join
we make a selfjoin, then we filter out self-pairs and duplicates with filter(name1 < name2)
.
After calculating the similarity score using the Jaro-Winkler method "jw"
we apply the similarity.
Thanks to @Rui Barradas please consider method = "soundex"
. See in comments!
library(dplyr)
library(stringdist)
df1 %>%
inner_join(df1, join_by(organization, occupation), suffix = c("1", "2"), relationship = "many-to-many") %>%
filter(name1 < name2) %>%
mutate(similarity = 1 - stringdist(name1, name2, method = "jw")) %>%
filter(similarity >= 0.75) %>% # define and tweak similarity_threshold
relocate(organization, occupation, name1, name2, similarity) |>
as_tibble()
organization occupation name1 name2 similarity
<chr> <chr> <chr> <chr> <dbl>
1 FCV teacher John S. Gooble Johnatan Sue Google 0.760
2 ABC director Anne S. Frank Anne Sue Frank 0.927
The challenge lies in determining the appropriate similarity_threshold, which in this case is set to 0.75
.
Upvotes: 3