Reputation: 596
I'm looking for an algorithm similar to largest common subsequence algorithms that has an alphabet letter similarity metric. What I mean is that known algorithms treat all letters of alphabet as completely different, my use case has letters of alphabet that are easier to edit into another letter, hence they should be treated as similar by diffing algorithm.
As usage example you may think about diffing algorithm working on lines of text where some lines are more similar to other lines.
The paper An O(ND) Difference Algorithm and Its Variations states on page 4: Consider adding a weight or cost to every edge. Give diagonal edges weight 0 and non-diagonal edges weight 1. I'd like to have an option to assign any weight from [0;1]
interval.
Upvotes: 3
Views: 282
Reputation: 114
The Largest Common Subsequence (LCS) problem is usually computed by dynamic programming methods and you can tweak existing methods to apply them to your use case.
In this example explaining how LCS works (from Wikipedia) https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Example, you should think tweaking the algorithm such that:
instead of scoring :
score_j = socre_i + 1
, for j = i +1
(that is to say, you add 1 when you find a new common item) when a new item is added to the LCS, you should score:
score_j = F(score_i, p(letter_i, letter_j))
no matter what.
p(letter_i, letter_j) is the probability to change from letter_i to letter_j
(that is the weight [0, 1]
you were talking about) F
is an aggreggation function, to go from score_i
to score_j
knowing that probability p
. For instance F
can be defined as the operator +
. It would then yield :
score_j = score_i + p(letter_i, letter_j))
or more precisely :
score_j = score_i + p(letter_i, letter_j)) x 1
(read the x 1
as of a character
)
and that woud give you the maximum similarity (of characters) of the 2 subsequences, that you can find by backtracking at the end of the algorithm.
You can find your own function F
to yield better results.
Upvotes: 1