Reputation: 13
For example, if I had a data.frame such as
df <- data.frame(Name = 'Chris','Christopher','John','Jon','Jonathan')
Is there a way for me to build a similarity matrix comparing how similar each individual name is to every other name in the 'Name' column?
I've tried using loop but not really sure how to apply this across the entire column
for(i in 1:nrow(df)){
df$distance[i] <- adist(df$Name[i], df$Name[i+1])
}
Upvotes: 1
Views: 100
Reputation: 1179
I got @zephryl 's solution to work with some minor edits.
df <- data.frame('Name' = c('Chris','Christopher','John','Jon','Jonathan'))
distances <- adist(df$Name)
distances <- as.data.frame(distances)
rownames(distances) <- df$Name
colnames(distances) <- df$Name
distances
Upvotes: 1
Reputation: 17039
Ditch the for
loop -- adist()
can do this directly:
distances <- adist(df$Name)
rownames(distances) <- df$Name
colnames(distances) <- df$Name
distances
Chris Christopher John Jon Jonathan
Chris 0 6 5 5 8
Christopher 6 0 9 10 9
John 5 9 0 1 4
Jon 5 10 1 0 5
Jonathan 8 9 4 5 0
Or use stringdist::stringdistmatrix()
, which yields a dist
object and has additional options, e.g., choice of distance metric:
library(stringdist)
stringdistmatrix(df$Name, method = "jaccard")
1 2 3 4
2 0.4444444
3 0.8750000 0.8181818
4 1.0000000 0.9090909 0.2500000
5 0.9000000 0.7500000 0.3333333 0.5000000
Upvotes: 0