Reputation: 1909

Approximate string matching of author names - modules and strategies

I've created a small program that checks if authors are present in a database of authors. I haven't been able to find any specific modules for this problem, so I'm writing it from scratch using modules for approximate string matching.

The database contains around 6000 authors and is very poorly formatted (many typos, variations, titles such as "Dr.", etc). The query author list is usually between 500-1000 (and I have many of these lists), making speed quite important.

My general strategy is to trim and filter the database as much as possible and look for exact matches. If no matches are found, I move on to approximate string matching.

I'm currently using the built-in difflib.get_close_matches which does exactly what I want- however, it is extremely slow (several minutes). Therefore, I am looking for other options:

What is the fastest module that can return the best, say, 3 matches above some threshold in a database giving a query string?
What's the fastest module for comparing two strings?

The only one I have found is fuzzy wuzzy, which is even slower than difflib.

Upvotes: 7

Answers (2)

ankostis

Reputation: 9473

Try fuzzywuzzy with the native-C python-levenshtein lib installed.

I run a benchmark on my PC for finding the best candidates of 8 words within ~19k words-list with and without C-native levenshtein backend installed (using pip install python_Levenshtein-0.12.0-cp34-none-win_amd64.whl) and i got these timings:

No C-backend:
Compared 151664 words in 48.591717004776 sec (0.00032039058052521366 sec/search).
C-backend installed:
Compared 151664 words in 13.034106969833374 sec (8.594067787895198e-05 sec/search).

That is ~x4 faster (but not as much as i expected).

Here are the results:

0 of 8: Compared 'Lemaire' --> `[('L.', 90), ('Le', 90), ('A', 90), ('Re', 90), ('Em', 90)]`
1 of 8: Compared 'Peil' --> `[('L.', 90), ('E.', 90), ('Pfeil', 89), ('Gampel', 76), ('Jo-pei', 76)]`
2 of 8: Compared 'Singleton' --> `[('Eto', 90), ('Ng', 90), ('Le', 90), ('to', 90), ('On', 90)]`
3 of 8: Compared 'Tagoe' --> `[('Go', 90), ('A', 90), ('T', 90), ('E.', 90), ('Sagoe', 80)]`
4 of 8: Compared 'Jgoun' --> `[('Go', 90), ('Gon', 75), ('Journo', 73), ('Jaguin', 73), ('Gounaris', 72)]`
5 of 8: Compared 'Ben' --> `[('Benfer', 90), ('Bence', 90), ('Ben-Amotz', 90), ('Beniaminov', 90), ('Benczak', 90)]`
6 of 8: Compared 'Porte' --> `[('Porter', 91), ('Portet', 91), ('Porten', 91), ('Po', 90), ('Gould-Porter', 90)]`
7 of 8: Compared 'Nyla' --> `[('L.', 90), ('A', 90), ('Sirichanya', 76), ('Neyland', 73), ('Greenleaf', 67)]`

And here is the python-code of the benchmark:

import os
import zipfile
from urllib import request as urlrequest
from fuzzywuzzy import process as fzproc
import time
import random

download_url = 'http://www.outpost9.com/files/wordlists/actor-surname.zip'
zip_name = os.path.basename(download_url)
fname, _ = os.path.splitext(zip_name)

def fuzzy_match(dictionary, search):
    nsearch = len(search)
    for i, s in enumerate(search):
        best = fzproc.extractBests(s, dictionary)
        print("%i of %i: Compared '%s' --> `%s`" % (i, nsearch, s, best))

def benchmark_fuzzy_match(wordslist, dict_split_ratio=0.9996):
    """ Shuffle and split words-list into `dictionary` and `search-words`. """
    rnd = random.Random(0)
    rnd.shuffle(wordslist)
    nwords = len(wordslist)
    ndictionary = int(dict_split_ratio * nwords)

    dictionary = wordslist[:ndictionary]
    search = wordslist[ndictionary:]
    fuzzy_match(dictionary, search)

    return ndictionary, (nwords - ndictionary)

def run_benchmark():
    if not os.path.exists(zip_name):
        urlrequest.urlretrieve(download_url, filename=zip_name)

    with zipfile.ZipFile(zip_name, 'r') as zfile:
        with zfile.open(fname) as words_file:
            blines = words_file.readlines()
            wordslist = [line.decode('ascii').strip() for line in blines]
            wordslist = wordslist[4:]  # Skip header.

            t_start = time.time()
            ndict, nsearch = benchmark_fuzzy_match(wordslist)
            t_finish = time.time()

            t_elapsed = t_finish - t_start
            ncomparisons = ndict * nsearch
            sec_per_search = t_elapsed / ncomparisons
            msg = "Compared %s words in %s sec (%s sec/search)."
            print(msg % (ncomparisons, t_elapsed, sec_per_search))

if __name__ == '__main__':
    run_benchmark()

Upvotes: 3

jimf

Reputation: 5137

Python's Natural Language Toolkit (nltk) might have some additional resourcs you could try out - this google groups thread seems like a good start on that. Just an idea.

Upvotes: 0

Approximate string matching of author names - modules and strategies

Answers (2)

Related Questions