polarbear
polarbear

Reputation: 13

What is the best method for fuzzy matching all elements of a single vector or column against all the elements within that same vector or column?

For example, if I had a data.frame such as

df <- data.frame(Name = 'Chris','Christopher','John','Jon','Jonathan')

Is there a way for me to build a similarity matrix comparing how similar each individual name is to every other name in the 'Name' column?

I've tried using loop but not really sure how to apply this across the entire column

for(i in 1:nrow(df)){
  df$distance[i] <- adist(df$Name[i], df$Name[i+1])
}

Upvotes: 1

Views: 100

Answers (2)

markhogue
markhogue

Reputation: 1179

I got @zephryl 's solution to work with some minor edits.

df <- data.frame('Name' = c('Chris','Christopher','John','Jon','Jonathan'))

distances <- adist(df$Name)
distances <- as.data.frame(distances)
rownames(distances) <- df$Name
colnames(distances) <- df$Name

distances

Upvotes: 1

zephryl
zephryl

Reputation: 17039

Ditch the for loop -- adist() can do this directly:

distances <- adist(df$Name)
rownames(distances) <- df$Name
colnames(distances) <- df$Name

distances
            Chris Christopher John Jon Jonathan
Chris           0           6    5   5        8
Christopher     6           0    9  10        9
John            5           9    0   1        4
Jon             5          10    1   0        5
Jonathan        8           9    4   5        0

Or use stringdist::stringdistmatrix(), which yields a dist object and has additional options, e.g., choice of distance metric:

library(stringdist)

stringdistmatrix(df$Name, method = "jaccard")
          1         2         3         4
2 0.4444444                              
3 0.8750000 0.8181818                    
4 1.0000000 0.9090909 0.2500000          
5 0.9000000 0.7500000 0.3333333 0.5000000

Upvotes: 0

Related Questions