Joseph
Joseph

Reputation: 33

Python 3 - efficient lookups with large lists

I'm making my own rhymer is Python using NLTK. The CMU-dict has over 130,000 entries in a format like this:

[['griffon', ['G', 'R', 'IH1', 'F', 'AH0', 'N']],
 ['griffy', ['G', 'R', 'IH1', 'F', 'IY0']],
 ['grigas', ['G', 'R', 'AY1', 'G', 'AH0', 'Z']]]

These are word, pron(unciation) pairs. I manipulate the prons (maybe switch a 'G' for a 'T') and check if that's a word. I do this by using this function:

all_word_prons = get_word_pron_pairs_in_cmu_dict()

def pron_is_a_word(pronunciation):
    for word, pron in all_word_prons:
        if pronunciation == pron:
            return True
        else:
            return False

all_word_prons is a Python Pickle file that is 10mb and contains all 130k entries

If I perform this look up 1000 times, it will take around 23 seconds, which isn't completely bad considering the task but there has to be a better algorithm. I've seen people recommend sets on bisects on other topics but those apply to simple string lookup. This is more or less checking to see if a list is equal, not a string.

I did implment some tree-like structure that contains data like this (using the example from above):

{'G': {'R': {'IH1': {'F': {'IY0': {0: 'griffy'}, 'AH0': {'N': {0: 'griffon'}}}}, 'AY1': {'G': {'AH0': {'Z': {0: 'grigas'}}}}}}}

This for some reason takes even longer than iterating through it simply. Perhaps my implementation is wrong. If you're curious:

def build_fast_idict_tree():
    from nltk.corpus import cmudict
    entries = cmudict.entries()
    idict = {}
    for entry in entries:
        word, pronunciation = entry
        idict_level = idict
        for syl in pronunciation:
            if syl not in idict_level:
                idict_level[syl] = {}
            idict_level = idict_level[syl]
        idict_level[0] = word
    return idict

def get_fast_idict_tree():
    filename = "fast_idict_tree.pickle"
    if os.path.isfile(filename):
        list = pickle.load(open(filename, "rb"))
    else:
        list = build_fast_idict_tree()
        pickle.dump(list, open(filename, "wb"))
    return list

def lookup_in_fast_idict_tree(syls):
    idict = get_fast_idict_tree()
    for syl in syls:
        if syl not in idict:
            return False
        idict= idict[syl]
    return idict[0] if 0 in idict else False

TL:DR

What is the fastest way to do this sort of lookup (matching a list) in Python 3 in the year 2015?

Upvotes: 0

Views: 187

Answers (2)

bbayles
bbayles

Reputation: 4527

If I understand correctly, you want to check to see whether some pronunciation is in your data set. From your first code block, it doesn't seem like you care what word the match came from.

Therefore, I think we could do:

pron_set = {tuple(pron) for word, pron in all_word_prons}
# do something to get a list of pronunciations to check
for pronunciation in pronunciation_list:
    if tuple(pronunciation) in pron_set:
        print('pronunctiation')

We construct pron_set from tuples because lists are not hashable (and can't be used as set members).

Set lookup should be much faster than iterating through the list. I would recommend being familiar with the Python data structures; you never know when a deque might save you lots of time.

Upvotes: 2

PaulKiripaul
PaulKiripaul

Reputation: 47

Have you considered using Python List Comprehensions as outlined here?

https://docs.python.org/3/tutorial/datastructures.html

In certain cases, list comprehensions can be faster than plain for-loops however it still executes a byte-code level loop. If you're not certain what I mean, checkout this thread: Are list-comprehensions and functional functions faster than "for loops"?

It may be worth a shot to see if this would be faster.

Upvotes: 0

Related Questions