T1m92
T1m92

Reputation: 3

Scan texts for keyword list, tolerate spelling mistakes, determine a matching score for texts

I have a problem which I cannot solve on my own now. My task is as following: I have various texts and an array of strings. The string array could contain single words or combination of words like this: ["apple", "orange fruit", "strawberry field", "ananas", "tomato plant"].

Now I need to scan my texts for the elements in the array and determine a score. If a text contains many of the strings (or a combination of it) it should result in a bigger score than other texts. The result should also tolerate spelling mistakes if possible.

Can someone give me a hint what would be the best way to solve this issue? Are there any libraries which could help solving this problem? The language I am coding in is Java.

Thank you in advance guys.

Upvotes: 0

Views: 270

Answers (1)

taz_13
taz_13

Reputation: 7

An alternative to the Soundex algorithm mentioned by Gilbert Le Blanc is to use LevenshteinDistance from the Apache Commons Text library. It simply looks at the number of changes needed to change one character sequence into another, and is very simple to use.

To accept words which requires two or less character changes to be identical you would do something like

LevenshteinDistance ld = new LevenshteinDistance(2);
if(ld.apply(string1, string2) < 0){
    //Do something, e.g. add to map
}

Upvotes: 0

Related Questions