Reputation: 143
I'm using the python-levenshtein module to analyse Irish language text over a large period of time; over time there are a number of orthographic changes to text e.g. bí -> ḃí -> bhí, the diacritic over the 'b' and the 'h' following the b both represent the same grammatical form of lenition (which is unshown in the first period).
Between all these forms I would want a fairly low distance, but using the python-levenshtein distance as it is gives the same distance between Levenshtein.ratio(u'ḃí', u'bí') = 0.5
and Levenshtein.ratio(u'xí', u'bí') = 0.5
, obviously a minor orthographic change to the character 'b' and it's outright substitution with 'x' (a foreign borrowing to boot) shouldn't have the same score.
So is there a way to modify the values of specific characacter changes e.g. reduce the distance of bí to ḃí but up the distance between bí and xí? Or will I need to produce my own implementation?
Upvotes: 1
Views: 70
Reputation: 251
Levenshtein algorithm ("edit distance") doesn't allow different distances between characters, but there's a generalization - the Needleman-Wunsch algorithm - that does. I'm not aware of a Python implementation, but would recommend to look for one before implementing your own - it's possible but non-trivial.
Upvotes: 1