Reputation: 707
I am looking for a way to detect the similarity of words (textstrings) based on their shared letters.
I was looking into Hash functions and especially the Rabin-Karp algorithm to find similar words inside larger strings.
But it doesnt work on the case I want: Three examples for words I consider "similar" in my case based on german banks: There is the "Deutsche Bank", "Postbank", and "Landesbank". All three have the word "bank" inside their name but only the Deutsche Bank has it as a individual word. So basically measure the similary of words based on their shared characters. I think there should be a limit that only similarities of >=4 characters should be considered if thats possible.
And if I was just looking for the word "bank" I would hardcode something. But im looking for a way to find such similar names/strings without knowing about it in the first place.
Upvotes: 0
Views: 2418
Reputation: 563
Sounds like a good use case for Levenshtein Distance. In brief, this calculates the number of "edits" you need to make to go from one string to another. You can get better by adding probabilities of edits occuring into the model too, but this may not be necessary. Usually I've found a 2-edit limit to be good for basic spell-correction.
Wiki Page: Levenshtein Distance
Upvotes: 1
Reputation: 20969
Grab an embedding model (there are trained ones with word2vec or glove) that gives you a dense vector representation of the semantic meaning of a word.
Then you can easily find semantic similarity by cosine distance between words by their embedding vector representation. The big improvement here over usual information retrieval techniques proposed (like levenshtein or common substrings / word overlap) is that it is actually a similarity based on the abstracted context of these words. So Postbank
would be very close to Barclays
or HSBC
by being a banking corp instead of just sharing substrings.
Conversely, once you have vectors of the words through their embedding, you can use any clustering algorithm to find clusters of words that share semantic similarity even without knowing that a specific taxonomy that describes this cluster actually exist.
Upvotes: 1
Reputation: 55619
To compare any 2 given strings, we can come up with a number signifying how good a match is based on the length of each substring above our threshold.
Comparing the given string with all other strings and finding those with the highest match numbers would give us the most "similar" strings.
My approach assumes the substring matches would be in order. This won't work if the order of the substrings doesn't matter, e.g. if abcdXXefgh
should be equally similar to abcdYYefgh
and efghYYabcd
. Although it might be possible to modify it by inspecting the values of c
in the end and calculating the match number from there.
It's possible to tweak it a bit if you want to take into account the lengths of the strings or other factors.
We can use an approach similar to the dynamic programming approach used for Levenshtein distance.
The distance of all possible prefixes might be stored in an array
d[][]
whered[i][j]
is the distance between the firsti
characters of strings
and the firstj
characters of stringt
.
We can have an additional array c
which keeps track of the length of the current substring match.
Then any given cell value [i,j]
would be the maximum of:
d[i-length,j-length]
) plus the cost of the current match (0 if length < threshold
, with length = c[i,j]
)d[i, j-1]
)d[i-1, j]
)d[i-1, j-1]
)Pseudo-code for this:
set each cell in c and d to 0
threshold = 4 // shortest possible substring to consider
for j = 1 to n
for i = 1 to m
currentSubstringCost = 0
if s[i] == t[j] // there's a matching substring
c[i, j] = c[i-1, j-1]
if c[i, j] >= threshold
currentSubstringCost = d[i - c[i, j], j - c[i, j]]
+ calculateSubstringCost(c[i, j])
d[i, j] = maximum(currentSubstringCost, // substring match
d[i-1, j], // deletion
d[i, j-1], // insertion
d[i-1, j-1]) // substitution
return d[m, n]
You'll need to figure out how you want to calculate the cost of any given substring (calculateSubstringCost
).
A simple approach would be to take the square of the substring length, so a substring of length 4 would cost 16
, while a substring of length 6 would cost 36
.
You can also optimise this to only require 2 rows by having another array which keeps track of the cost before the current substring match, which results in us only needing to look backwards at most 1 row (see above link for example).
Upvotes: 1
Reputation: 194
Correct me if I am wrong. From your question, I understand that you need to find all Strings which have some part in common.
Can we find common SubStrings between all the Strings. Based on the length of Substring we can give a score. Based on a threshold you can decide whether the strings belong to same group or not.
Upvotes: 5