kate88
kate88

Reputation: 371

Matching strings using R

I have a general question. I try to do string matching between data frames in R. My strings have the format below:

"COOL FOODS LTD 222 HIGH ST LONDON ABC123"  

I would like to iterate over other data frames and would like my code to find matches between the above string and the strings below:

"222 HIGH ST LONDON ABC123 COOL FOODS LTD " 
"HIGH LTD ST 222 LONDON COOL ABC123 FOODS "
"COOL FOODS LTD 222 HIGH ST LONDON UNITED KINGDOM ABC123"

I tried adist, but the similarity scores I get using that method are not very good when parts of the string are rearranged or when the inserted part is long (as per the examples).

I thought about splitting my strings by white spaces, but I'm not sure how to then do the matching and comparing efficiently with many data frames.

I would be grateful for any suggestions!

Cheers!

Upvotes: 1

Views: 504

Answers (1)

Rui Barradas
Rui Barradas

Reputation: 76651

Using package stringdist you can write a helper function that compares a string to each target string in a vector.
The function below first strplit's and sort's all strings. Then calls stringsim to compute a similarity score.

funSimilarity <- function(x, y, method = "osa"){
    x <- strsplit(x, " ")[[1]]
    x <- paste(sort(x), collapse = " ")
    y_list <- strsplit(y, " ")
    y_list <- lapply(y_list, function(.y) paste(sort(.y), collapse = " "))
    stringsim(x, unlist(y_list), method = method)
}

funSimilarity(x, y)
#[1] 1.0000000 1.0000000 0.7272727

met <- c("osa", "lv", "dl", "hamming", "lcs", "qgram",
  "cosine", "jaccard", "jw", "soundex")

sapply(met, function(m) funSimilarity(x, y, method = m))
#           osa        lv        dl hamming       lcs     qgram    cosine
#[1,] 1.0000000 1.0000000 1.0000000       1 1.0000000 1.0000000 1.0000000
#[2,] 1.0000000 1.0000000 1.0000000       1 1.0000000 1.0000000 1.0000000
#[3,] 0.7272727 0.7272727 0.7272727       0 0.8421053 0.8421053 0.9689541
#       jaccard        jw soundex
#[1,] 1.0000000 1.0000000       1
#[2,] 1.0000000 1.0000000       1
#[3,] 0.8095238 0.8632576       1

Upvotes: 1

Related Questions