Vitoria Sanchez
Vitoria Sanchez

Reputation: 21

Match similar names

I have a database with three columns: name, occupation, and organization. In these columns, I have duplicates with slightly different names. For example, Anne Sue Frank and Anne S. Frank refer to the same person, as they have the same occupation and belong to the same organization.

Is there a way for me to create another table that maps these names to their corresponding matches? I tried using stringdist, but it mostly matches names with themselves (e.g., Anne Sue Frank with Anne Sue Frank that are the same row), which I don't want. My objetive is finding the duplicate names, so I would need a new table showing the name in a column with their correspondet in the second column.

Example:

df1 <- data.frame(
  name = c("Anne Sue Frank", "John S. Gooble", "Anne S. Frank", "Johnatan Sue Google"),
  organization = c("ABC", "FCV", "ABC", "FCV"), 
  occupation = c("director", "teacher", "director", "teacher"),
  stringsAsFactors = FALSE
)

df1

name organization occupation
1      Anne Sue Frank          ABC   director
2      John S. Gooble          FCV    teacher
3       Anne S. Frank          ABC   director
4 Johnatan Sue Google          FCV    teacher

Upvotes: 1

Views: 53

Answers (1)

TarJae
TarJae

Reputation: 79246

Here is a solution with stringdist

First with inner_join we make a selfjoin, then we filter out self-pairs and duplicates with filter(name1 < name2). After calculating the similarity score using the Jaro-Winkler method "jw" we apply the similarity.

Thanks to @Rui Barradas please consider method = "soundex". See in comments!

library(dplyr)
library(stringdist)

df1 %>%
  inner_join(df1, join_by(organization, occupation), suffix = c("1", "2"), relationship = "many-to-many") %>%
  filter(name1 < name2) %>%
  mutate(similarity = 1 - stringdist(name1, name2, method = "jw")) %>%
  filter(similarity >= 0.75) %>% # define and tweak similarity_threshold
  relocate(organization, occupation, name1, name2, similarity) |> 
  as_tibble()

  organization occupation name1          name2               similarity
  <chr>        <chr>      <chr>          <chr>                    <dbl>
1 FCV          teacher    John S. Gooble Johnatan Sue Google      0.760
2 ABC          director   Anne S. Frank  Anne Sue Frank           0.927

The challenge lies in determining the appropriate similarity_threshold, which in this case is set to 0.75.

Upvotes: 3

Related Questions