JoshDG
JoshDG

Reputation: 3931

Imperfect String Matching

Say I have two columns of names. All names in the first column are in the second column, but in a random order, AND some of them are not perfect matches. So maybe in one column theres the name John Smith and in the second John_smith or JonSmith. Is there any fairly simple R way of performing a "best match"?

Upvotes: 5

Views: 1891

Answers (1)

Justin
Justin

Reputation: 43255

Given some data like this:

df<-data.frame(x=c('john doe','john smith','sally struthers'),y=c('John Smith','John_smith','JonSmith'))

You can get a long way with a few gsubs and tolower:

df$y.fix <- gsub('[[:punct:]]', ' ', df$y)
df$y.fix <- gsub(' ', '', df$y.fix)
df$y.fix <- tolower(df$y.fix)
df$x.fix <- tolower(gsub(' ', '', df$x))

Then agrep is what you'll want:

> agrep(df$x.fix[2], df$y.fix)
[1] 1 2 3

for more complex confusing strings, see this post from last week.

Upvotes: 10

Related Questions