Reputation: 18387
I need to create a search for name of people. I already saw the great content in here but I need something different.
Here's my requirement.
I've tried to use a phonetic search, but the name of people that I need to index are non English names. I believe that phonetic algorithm implemented by Apache Solr / Lucene are not valid for Portuguese words (my culture).
After that, I decided to search using ngrams. It seems to work, but I need to somehow compare how close what user typed looks like what Solr index have. I could not use score, because it use the number of times some word exist in all documents. So I need somehow to give a number (percentage for example) as a result of the comparisson, in other words, how close what user typed looks like the real name that I have on solr.
Ps: I will use this result in my application to use what user typed or continue with what exists on my Solr.
Sample:
ID NAME
1 James Bond
2 James Bond Junior
3 Tony Mellord
The use could search for Jhames Bond and using Ngrams both 1 and 2 will match.
PS: I used English names just to clarify the scenario.
Is there any way to give the answer: How much what user typed looks like what I have indexed without use score? Let's say:
Jhames Bond looks like James Bond in 97% (for example)
Jhames Bond looks like James Bond Junior in 87%
Upvotes: 1
Views: 823
Reputation: 33341
If you are happy with how you are querying, and just want to come up with the percentage, you could compare the query value with the value returned from the index, as a postprocessing step, using a Levenshtein distance.
There is an implementation of the Levenshtein distance algorithm in the Apache Commons: StringUtils.getLevenshteinDistance
The maximum possible distance would be the length of the longest string compared, so to get a percentage might look something like:
1-(StringUtils.getLevenshteinDistance(str1, str2) / Math.max(str1.length(), str2.length()));
Jaro-Winkler Distance (StringUtils.getJaroWinklerDistance
) might also be a better algorithm to use, and a bit simpler since it is already normalized such that it could be presented as a percentage. It also seems to come out closer to the example values you have provided.
Upvotes: 2