static_rtti
static_rtti

Reputation: 56342

Algorithm for fuzzy pairing of strings from two lists

Suppose you have two lists of strings containing similar items, with changes (eg. List 1: Apples,fruits_b,orange; List2: Fruit,apples,banana,orange_juice).

Given a distance metric such as the Levenshtein distance, what are good algorithms for finding the optimal pairing, that is the pairing that minimizes the sum of distances for all pairings?

The result corresponding to my example would be:

Apples    - apples
fruits_b  - Fruit
orange    - orange_juice
          - banana

Subsidiary question: is there some tool that already implements this or something similar?

Upvotes: 4

Views: 1335

Answers (1)

static_rtti
static_rtti

Reputation: 56342

OK, here's my python solution using the Levenshtein distance and the Hungarian algorithm (both provided by external packages):

from munkres import Munkres
from Levenshtein import distance
from sys import argv

if __name__ == '__main__':
    if len(argv) < 3:
        print("Usage: fuzzy_match.py file file")
        print("Finds the best pairing of lines from the two input files")
        print("using the Levenshtein distance and the Hungarian algorithm")
    w1 = [l.strip() for l in open(argv[1]).readlines()]
    w2 = [l.strip() for l in open(argv[2]).readlines()]
    if len(w1) != len(w2):
        if len(w2) > len(w1):
            w1, w2 = w2, w1
        w2.extend([""]*(len(w1)-len(w2)))
    matrix = []
    for i in w1: 
        row = []
        for j in w2:
            row.append(distance(i.lower(), j.lower()))
        matrix.append(row)
    m = Munkres()
    max_length = max(len(w) for w in w1)
    for i, j in m.compute(matrix):
        print(("{:<%d}{}" % (max_length+10)).format(w1[i], w2[j]))

It works pretty nicely. I'm still curious if anyone can come up with a better algorithm, though!

Upvotes: 5

Related Questions