jvh_ch
jvh_ch

Reputation: 347

How to normalise Levenshtein distance for maximum alignment length rather than for string length?

Problem: A few R packages feature Levenshtein distance implementations for computing the similarity of two strings, e.g. http://finzi.psych.upenn.edu/R/library/RecordLinkage/html/strcmp.html. The distances computed can easily be normalised for string length, e.g. by dividing the Levenshtein distance by the length of the longest string involved or by dividing it by the mean of the lengths of the two strings. For some applications in linguistics (e.g. dialectometry and receptive multilingualism research), however, it is recommended that the raw Levenshtein distance be normalised for the length of the longest least-cost alignment (Heeringa, 2004: 130-132). This tends to produce distance measures that make more sense from a perceptual-linguistic point of view.

Example: The German string "tsYklUs" (Zyklus = cycle) can be converted into its Swedish cognate "sYkEl" (cyckel = (bi)cycle) in a 7-slot alignment with two insertions (I) and two substitutions (S) for a total transformation cost of 4. Normalised Levenshtein distance: 4/7

(A)

t--s--Y--k--l--U--s
---s--Y--k--E--l---
===================
I-----------S--S--I = 4

It is also possible to convert the strings in an 8-slot alignment with 3 insertions (I) and 1 deletion (D), also for a total alignment cost of 4. Normalised Levenshtein distance: 4/8

(B)

t--s--Y--k-----l--U--S
---s--Y--k--E--l------
======================
I-----------D-----I--I = 4

The latter alignment makes more sense linguistically, because it aligns the [l]-phonemes with each other rather than with the [E] and [U] vowels.

Question: Does anyone know of any R function that would allow me to normalise Levenshtein distances for the longest least-cost alignment rather than for string length proper? Thanks for your input!

Reference: W.J. Heeringa (2004), Measuring dialect pronunciation differences using Levenshtein distance. PhD thesis, University of Groningen. http://www.let.rug.nl/~heeringa/dialectology/thesis/

Edit - Solution: I think I figured out a solution. The adist function can return the alignment and seems to default to the longest low-cost alignment. To take up the example above, here's the alignment associated with sykel to tsyklus:

> attr(adist("sykel", "tsyklus", counts = TRUE), "trafos")
     [,1]      
[1,] "IMMMDMII"

To compute length-normalised distances as recommended by Heeringa (2004), we can write a modest function:

normLev.fnc <- function(a, b) {
  drop(adist(a, b) / nchar(attr(adist(a, b, counts = TRUE), "trafos")))
}

For the example above, this returns

> normLev.fnc("sykel", "tsyklus")
[1] 0.5

This function also returns the correct normalised distances for Heeringa's (2004: 131) examples:

> normLev.fnc("bine", "bEi")
[1] 0.6
> normLev.fnc("kaninçen", "konEin")
[1] 0.5555556
> normLev.fnc("kenEeri", "kenArje")
[1] 0.5

To compare several pairs of strings:

> L1 <- c("bine", "kaninçen", "kenEeri")
> L2 <- c("bEi",  "konEin", "kenArje")
> diag(normLev.fnc(L1, L2))
[1] 0.6000000 0.5555556 0.5000000

Upvotes: 15

Views: 5890

Answers (1)

jvh_ch
jvh_ch

Reputation: 347

In case any linguists stumble upon this post, I'd like to point out that the algorithms provided by the RecordLinkage package are not necessarily optimal for comparing non-ASCII strings, e.g.:

> levenshteinSim("väg", "way")
[1] -0.3333333
> levenshteinDist("väg", "way")
[1] 4
> levenshteinDist("väg", "wäy")
[1] 2
> levenshteinDist("väg", "wüy")
[1] 3

Upvotes: 4

Related Questions