t_97
t_97

Reputation: 43

How to compare two character vectors for near equality in R?

I have two datasets, and I want to compare the values of one column in one dataset with values of a column in another dataset. Given this is the column of the first dataset:

df1$names
KLERK
JAMES MEDICAL CITY
MISSION & COMMUNITY CENTER
COSTA MEDICAL CENTER
...

Given the column in the other dataset is:

df2$names
Klerk Hospital
James Medical
Mission and Community
Costa Medical
...

As you can see, the values are not completely equal, but for example, KLERK and Klerk Hospital are the same hospital. If I use basic R comparisons, it only gives me FALSE. I also tried to use the stri_cmp_equiv function with no success. I only have names and addresses that I can use to match, but with the address, it is the same problem. For example 1200 E Road and 1200 East Road. I am looking for a function or solution to compare the character values for if they are close but not completely equal. The outcome should be TRUE/FALSE. Thanks for any help!

Upvotes: 1

Views: 1146

Answers (2)

missuse
missuse

Reputation: 19716

One approach would be to compute a suitable distance metric between strings, and based on a certain similarity/disimilarity threshold assign TRUE or FALSE. In R the package stringdist is created for this sort of task. Before comparison I would convert all characters to lower.

library(stringdist)
stringdist(tolower(vec1), tolower(vec2))
#output
9  5 10  7

you can change the metric to the one that suits your need the best:

stringdist(tolower(vec1), tolower(vec2), method = "cosine")
#output
0.28730335 0.07852140 0.09548458 0.09507710

The function stringsim returns similarity:

stringsim(tolower(vec1), tolower(vec2))
#output
0.3571429 0.7222222 0.6153846 0.6500000

Example:

df <- data.frame(v1 = vec1, v2 = vec2)

library(tidyverse)

df %>%
  mutate(cosine = stringdist(tolower(df$v1), tolower(df$v2), method = "cosine"),
         decision = cosine < 0.2)
#output
                          v1                    v2     cosine decision
1                      KLERK        Klerk Hospital 0.28730335    FALSE
2         JAMES MEDICAL CITY         James Medical 0.07852140     TRUE
3 MISSION & COMMUNITY CENTER Mission and Community 0.09548458     TRUE
4       COSTA MEDICAL CENTER         Costa Medical 0.09507710     TRUE

There are several different methods, in addition with methods osa or dl you can set the penalty for deletion, insertion, substitution and transposition, in that order. Example (low penalty for deletion and insertion but high for substitution and transposition):

df %>%
  mutate(diff = stringdist(tolower(df$v1), tolower(df$v2),
                           method = "osa",
                           weight = c(0.01, 0.01, 1, 1)),
         decision = diff < 0.2)

#output
                          v1                    v2 diff decision
1                      KLERK        Klerk Hospital 0.09     TRUE
2         JAMES MEDICAL CITY         James Medical 0.05     TRUE
3 MISSION & COMMUNITY CENTER Mission and Community 0.11     TRUE
4       COSTA MEDICAL CENTER         Costa Medical 0.07     TRUE

EDIT: as suggested by @danlooo in the comments another option would be to make a local string alignment and output its score. Based on the score assign a threshold to convert to logical. This can be performed with the Biostrings package available via Bioconductor (see install section in the provided link if you decide to go this raute):

library(Biostrings)
apply(df, 1, function(x){
  pairwiseAlignment(tolower(x[1]), tolower(x[2]), type = "overlap")@score
})
#output
17.83352 46.36715 45.78177 47.86837

to return the whole pairwiseAlignment objects:

apply(df, 1, function(x){
  pairwiseAlignment(tolower(x[1]), tolower(x[2]), type = "overlap")
})
#output
[[1]]
Overlap PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] klerk
subject: [1] klerk
score: 17.83352 

[[2]]
Overlap PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] james medical
subject: [1] james medical
score: 46.36715 

[[3]]
Overlap PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] mission &-- community
subject: [1] mission and community
score: 45.78177 

[[4]]
Overlap PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] costa medical
subject: [1] costa medical
score: 47.86837

Just bare in mind this last code run returns S4 type of object.

data:

vec1 <- c("KLERK",
          "JAMES MEDICAL CITY",
          "MISSION & COMMUNITY CENTER",
          "COSTA MEDICAL CENTER")

vec2 <- c("Klerk Hospital",
          "James Medical",
          "Mission and Community",
          "Costa Medical")

Upvotes: 4

Hansel Palencia
Hansel Palencia

Reputation: 1046

here is a solution to you problem.

Basically we can use regex to determine how closely matched each string is to it's companion string. If over 50% of the individual words match then it returns true.

names1 <- data.frame(Name = c("KLERK", "JAMES MEDICAL CITY", "MISSION & COMMUNITY CENTER", "COSTA MEDICAL CENTER"))
names2 <- data.frame(Name = c("Klerk Hospital", "James Medical", "Mission and Community", "Cost Medical"))

names1$Name <- str_to_lower(names1$Name)
names2$Name <- str_to_lower(names2$Name)

split_names1 <- str_split(names1$Name, " ")
split_names2 <- str_split(names2$Name, " ")


str_detect(split_names1[[1]], split_names2[[1]])


my_matches <- mapply(str_detect, split_names1, split_names2)


my_func <- function(x) {
  
  ifelse(sum(x)/length(x) >= .5, T, F)
  
}

map(my_matches, my_func)


[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

[[4]]
[1] TRUE


Upvotes: 0

Related Questions