Herka
Herka

Reputation: 707

Algorithm that finds similar words based on their letters

I am looking for a way to detect the similarity of words (textstrings) based on their shared letters.

I was looking into Hash functions and especially the Rabin-Karp algorithm to find similar words inside larger strings.

But it doesnt work on the case I want: Three examples for words I consider "similar" in my case based on german banks: There is the "Deutsche Bank", "Postbank", and "Landesbank". All three have the word "bank" inside their name but only the Deutsche Bank has it as a individual word. So basically measure the similary of words based on their shared characters. I think there should be a limit that only similarities of >=4 characters should be considered if thats possible.

And if I was just looking for the word "bank" I would hardcode something. But im looking for a way to find such similar names/strings without knowing about it in the first place.

Upvotes: 0

Views: 2418

Answers (4)

Ben Horsburgh
Ben Horsburgh

Reputation: 563

Sounds like a good use case for Levenshtein Distance. In brief, this calculates the number of "edits" you need to make to go from one string to another. You can get better by adding probabilities of edits occuring into the model too, but this may not be necessary. Usually I've found a 2-edit limit to be good for basic spell-correction.

Wiki Page: Levenshtein Distance

Upvotes: 1

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

Grab an embedding model (there are trained ones with word2vec or glove) that gives you a dense vector representation of the semantic meaning of a word.

Then you can easily find semantic similarity by cosine distance between words by their embedding vector representation. The big improvement here over usual information retrieval techniques proposed (like levenshtein or common substrings / word overlap) is that it is actually a similarity based on the abstracted context of these words. So Postbank would be very close to Barclays or HSBC by being a banking corp instead of just sharing substrings.

Conversely, once you have vectors of the words through their embedding, you can use any clustering algorithm to find clusters of words that share semantic similarity even without knowing that a specific taxonomy that describes this cluster actually exist.

Upvotes: 1

Bernhard Barker
Bernhard Barker

Reputation: 55619

To compare any 2 given strings, we can come up with a number signifying how good a match is based on the length of each substring above our threshold.

Comparing the given string with all other strings and finding those with the highest match numbers would give us the most "similar" strings.

My approach assumes the substring matches would be in order. This won't work if the order of the substrings doesn't matter, e.g. if abcdXXefgh should be equally similar to abcdYYefgh and efghYYabcd. Although it might be possible to modify it by inspecting the values of c in the end and calculating the match number from there.

It's possible to tweak it a bit if you want to take into account the lengths of the strings or other factors.


We can use an approach similar to the dynamic programming approach used for Levenshtein distance.

The distance of all possible prefixes might be stored in an array d[][] where d[i][j] is the distance between the first i characters of string s and the first j characters of string t.

We can have an additional array c which keeps track of the length of the current substring match.

Then any given cell value [i,j] would be the maximum of:

  • The cost before the current substring match (d[i-length,j-length]) plus the cost of the current match (0 if length < threshold, with length = c[i,j])
  • As defined in Levenshtein distance: (values are copied directly, these operations have no cost)
    • Insertion (d[i, j-1])
    • Deletion (d[i-1, j])
    • Substitution (same or different character) (d[i-1, j-1])

Pseudo-code for this:

set each cell in c and d to 0

threshold = 4     // shortest possible substring to consider

for j = 1 to n
for i = 1 to m
   currentSubstringCost = 0
   if s[i] == t[j]     // there's a matching substring
      c[i, j] = c[i-1, j-1]
      if c[i, j] >= threshold
          currentSubstringCost = d[i - c[i, j], j - c[i, j]]
                                 + calculateSubstringCost(c[i, j])
   d[i, j] = maximum(currentSubstringCost,      // substring match
                     d[i-1, j],                 // deletion
                     d[i, j-1],                 // insertion
                     d[i-1, j-1])               // substitution

return d[m, n]

You'll need to figure out how you want to calculate the cost of any given substring (calculateSubstringCost).

A simple approach would be to take the square of the substring length, so a substring of length 4 would cost 16, while a substring of length 6 would cost 36.

You can also optimise this to only require 2 rows by having another array which keeps track of the cost before the current substring match, which results in us only needing to look backwards at most 1 row (see above link for example).

Upvotes: 1

Anjan Dash
Anjan Dash

Reputation: 194

Correct me if I am wrong. From your question, I understand that you need to find all Strings which have some part in common.

Can we find common SubStrings between all the Strings. Based on the length of Substring we can give a score. Based on a threshold you can decide whether the strings belong to same group or not.

Upvotes: 5

Related Questions