Reputation: 43
I have two datasets, and I want to compare the values of one column in one dataset with values of a column in another dataset. Given this is the column of the first dataset:
df1$names
KLERK
JAMES MEDICAL CITY
MISSION & COMMUNITY CENTER
COSTA MEDICAL CENTER
...
Given the column in the other dataset is:
df2$names
Klerk Hospital
James Medical
Mission and Community
Costa Medical
...
As you can see, the values are not completely equal, but for example, KLERK
and Klerk Hospital
are the same hospital. If I use basic R comparisons, it only gives me FALSE. I also tried to use the stri_cmp_equiv function with no success. I only have names and addresses that I can use to match, but with the address, it is the same problem. For example 1200 E Road
and 1200 East Road
. I am looking for a function or solution to compare the character values for if they are close but not completely equal. The outcome should be TRUE
/FALSE
. Thanks for any help!
Upvotes: 1
Views: 1146
Reputation: 19716
One approach would be to compute a suitable distance metric between strings, and based on a certain similarity/disimilarity threshold assign TRUE or FALSE. In R the package stringdist
is created for this sort of task. Before comparison I would convert all characters to lower.
library(stringdist)
stringdist(tolower(vec1), tolower(vec2))
#output
9 5 10 7
you can change the metric to the one that suits your need the best:
stringdist(tolower(vec1), tolower(vec2), method = "cosine")
#output
0.28730335 0.07852140 0.09548458 0.09507710
The function stringsim
returns similarity:
stringsim(tolower(vec1), tolower(vec2))
#output
0.3571429 0.7222222 0.6153846 0.6500000
Example:
df <- data.frame(v1 = vec1, v2 = vec2)
library(tidyverse)
df %>%
mutate(cosine = stringdist(tolower(df$v1), tolower(df$v2), method = "cosine"),
decision = cosine < 0.2)
#output
v1 v2 cosine decision
1 KLERK Klerk Hospital 0.28730335 FALSE
2 JAMES MEDICAL CITY James Medical 0.07852140 TRUE
3 MISSION & COMMUNITY CENTER Mission and Community 0.09548458 TRUE
4 COSTA MEDICAL CENTER Costa Medical 0.09507710 TRUE
There are several different methods, in addition with methods osa
or dl
you can set the penalty for deletion, insertion, substitution and transposition, in that order. Example (low penalty for deletion and insertion but high for substitution and transposition):
df %>%
mutate(diff = stringdist(tolower(df$v1), tolower(df$v2),
method = "osa",
weight = c(0.01, 0.01, 1, 1)),
decision = diff < 0.2)
#output
v1 v2 diff decision
1 KLERK Klerk Hospital 0.09 TRUE
2 JAMES MEDICAL CITY James Medical 0.05 TRUE
3 MISSION & COMMUNITY CENTER Mission and Community 0.11 TRUE
4 COSTA MEDICAL CENTER Costa Medical 0.07 TRUE
EDIT: as suggested by @danlooo in the comments another option would be
to make a local string alignment and output its score. Based on the score assign a threshold to convert to logical. This can be performed with the Biostrings
package available via Bioconductor (see install section in the provided link if you decide to go this raute):
library(Biostrings)
apply(df, 1, function(x){
pairwiseAlignment(tolower(x[1]), tolower(x[2]), type = "overlap")@score
})
#output
17.83352 46.36715 45.78177 47.86837
to return the whole pairwiseAlignment
objects:
apply(df, 1, function(x){
pairwiseAlignment(tolower(x[1]), tolower(x[2]), type = "overlap")
})
#output
[[1]]
Overlap PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] klerk
subject: [1] klerk
score: 17.83352
[[2]]
Overlap PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] james medical
subject: [1] james medical
score: 46.36715
[[3]]
Overlap PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] mission &-- community
subject: [1] mission and community
score: 45.78177
[[4]]
Overlap PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] costa medical
subject: [1] costa medical
score: 47.86837
Just bare in mind this last code run returns S4 type of object.
data:
vec1 <- c("KLERK",
"JAMES MEDICAL CITY",
"MISSION & COMMUNITY CENTER",
"COSTA MEDICAL CENTER")
vec2 <- c("Klerk Hospital",
"James Medical",
"Mission and Community",
"Costa Medical")
Upvotes: 4
Reputation: 1046
here is a solution to you problem.
Basically we can use regex to determine how closely matched each string is to it's companion string. If over 50% of the individual words match then it returns true.
names1 <- data.frame(Name = c("KLERK", "JAMES MEDICAL CITY", "MISSION & COMMUNITY CENTER", "COSTA MEDICAL CENTER"))
names2 <- data.frame(Name = c("Klerk Hospital", "James Medical", "Mission and Community", "Cost Medical"))
names1$Name <- str_to_lower(names1$Name)
names2$Name <- str_to_lower(names2$Name)
split_names1 <- str_split(names1$Name, " ")
split_names2 <- str_split(names2$Name, " ")
str_detect(split_names1[[1]], split_names2[[1]])
my_matches <- mapply(str_detect, split_names1, split_names2)
my_func <- function(x) {
ifelse(sum(x)/length(x) >= .5, T, F)
}
map(my_matches, my_func)
[[1]]
[1] TRUE
[[2]]
[1] TRUE
[[3]]
[1] TRUE
[[4]]
[1] TRUE
Upvotes: 0