Reputation: 2886
I have two vector of type character in R.
I want to be able to compare the reference list to the raw character list using jarowinkler and assign a % similarity score. So for example if i have 10 reference items and twenty raw data items, i want to be able to get the best score for the comparison and what the algorithm matched it to (so 2 vectors of 10). If i have raw data of size 8 and 10 reference items, i should only end up with a 2 vector result of 8 items with the best match and score per item
item, match, matched_to ice, 78, ice-cream
Below is my code which isn't much to look at.
NumItems.Raw = length(words)
NumItems.Ref = length(Ref.Desc)
for (item in words)
{
for (refitem in Ref.Desc)
{
jarowinkler(refitem,item)
# Find Best match Score
# Find Best Item in reference table
# Add both items to vectors
# decrement NumItems.Raw
# Loop
}
}
Upvotes: 8
Views: 9737
Reputation:
Using a toy example:
library(RecordLinkage)
library(dplyr)
ref <- c('cat', 'dog', 'turtle', 'cow', 'horse', 'pig', 'sheep', 'koala','bear','fish')
words <- c('dog', 'kiwi', 'emu', 'pig', 'sheep', 'cow','cat','horse')
wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)
wordlist %>% group_by(words) %>% mutate(match_score = jarowinkler(words, ref)) %>%
summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])
gives
words match matched_to
1 cat 1.0000000 cat
2 cow 1.0000000 cow
3 dog 1.0000000 dog
4 emu 0.5277778 bear
5 horse 1.0000000 horse
6 kiwi 0.5350000 koala
7 pig 1.0000000 pig
8 sheep 1.0000000 sheep
Edit: As a response to the OP's comment, the last command uses the pipeline approach from dplyr
, and groups every combination of the raw words and references by the raw words, adds a column match_score with the jarowinkler score, and returns only a summary of the highest match score (indexed by which.max(match_score)), as well as the reference which also is indexed by the maximum match_score.
Upvotes: 14
Reputation: 876
There is a package which already implements the Jaro-Winkler distance.
> install.packages("stringdist")
> library(stringdist)
> 1-stringdist('ice','ice-cream',method='jw')
[1] 0.7777778
Upvotes: 3