Feng Chen
Feng Chen

Reputation: 2253

fuzzy matching two strings uring r

I have two vectors, each of which includes a series of strings. For example,

V1=c("pen", "document folder", "warn")
V2=c("pens", "copy folder", "warning")

I need to find which two are matched the best. I directly use levenshtein distance. But it is not good enough. In my case, pen and pens should mean the same. document folder and copy folder are probably the same thing. warn and warning are actually the same. I am trying to use the packages like tm. But I am not very sure which functions are suitable for doing this. Can anyone tell me about this?

Upvotes: 2

Views: 1109

Answers (2)

statespace
statespace

Reputation: 1664

Here's wiki for Levenshtein distance. It measures how many delete/change/insert actions need to be taken to transform strings. And one of approaches for fuzzy matching is minimizing this value.

Here's an example. I shuffled up order a bit, to make it less boring:

V1 <- c("pen", "document folder", "warn")
V2 <- c("copy folder", "warning", "pens")

apply(adist(x = V1, y = V2), 1, which.min)
[1] 3 1 2

Output means, which positions of V2 correspond to closest transformation of V1, in order of V1.

data.frame(string_to_match = V1, 
           closest_match = V2[apply(adist(x = V1, y = V2), 1, which.min)])
  string_to_match closest_match
1             pen          pens
2 document folder   copy folder
3            warn       warning

Upvotes: 2

Tobias Dekker
Tobias Dekker

Reputation: 1030

In my experience the cosine match is a good one for such kind of a jobs:

V1 <- c("pen", "document folder", "warn")
V2 <- c("copy folder", "warning", "pens")   
result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 1))
rownames(result) <- V2
result
                  pen document folder      warn
copy folder 0.6797437       0.2132042 0.8613250
warning     0.6150998       0.7817821 0.1666667
pens        0.1339746       0.6726732 0.7500000

You have to define a cut off when the distance is close enough, how lower the distance how better they match. You can also play with the Q parameter which says how many letters combinations should be compared to each other. For example:

result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 3))
rownames(result) <- V2
result
                  pen document folder      warn
copy folder 1.0000000       0.5377498 1.0000000
warning     1.0000000       1.0000000 0.3675445
pens        0.2928932       1.0000000 1.0000000

Upvotes: 3

Related Questions