Reputation: 33
I'm making my own rhymer is Python using NLTK. The CMU-dict has over 130,000 entries in a format like this:
[['griffon', ['G', 'R', 'IH1', 'F', 'AH0', 'N']],
['griffy', ['G', 'R', 'IH1', 'F', 'IY0']],
['grigas', ['G', 'R', 'AY1', 'G', 'AH0', 'Z']]]
These are word, pron(unciation) pairs. I manipulate the prons (maybe switch a 'G' for a 'T') and check if that's a word. I do this by using this function:
all_word_prons = get_word_pron_pairs_in_cmu_dict()
def pron_is_a_word(pronunciation):
for word, pron in all_word_prons:
if pronunciation == pron:
return True
else:
return False
all_word_prons is a Python Pickle file that is 10mb and contains all 130k entries
If I perform this look up 1000 times, it will take around 23 seconds, which isn't completely bad considering the task but there has to be a better algorithm. I've seen people recommend sets on bisects on other topics but those apply to simple string lookup. This is more or less checking to see if a list is equal, not a string.
I did implment some tree-like structure that contains data like this (using the example from above):
{'G': {'R': {'IH1': {'F': {'IY0': {0: 'griffy'}, 'AH0': {'N': {0: 'griffon'}}}}, 'AY1': {'G': {'AH0': {'Z': {0: 'grigas'}}}}}}}
This for some reason takes even longer than iterating through it simply. Perhaps my implementation is wrong. If you're curious:
def build_fast_idict_tree():
from nltk.corpus import cmudict
entries = cmudict.entries()
idict = {}
for entry in entries:
word, pronunciation = entry
idict_level = idict
for syl in pronunciation:
if syl not in idict_level:
idict_level[syl] = {}
idict_level = idict_level[syl]
idict_level[0] = word
return idict
def get_fast_idict_tree():
filename = "fast_idict_tree.pickle"
if os.path.isfile(filename):
list = pickle.load(open(filename, "rb"))
else:
list = build_fast_idict_tree()
pickle.dump(list, open(filename, "wb"))
return list
def lookup_in_fast_idict_tree(syls):
idict = get_fast_idict_tree()
for syl in syls:
if syl not in idict:
return False
idict= idict[syl]
return idict[0] if 0 in idict else False
What is the fastest way to do this sort of lookup (matching a list) in Python 3 in the year 2015?
Upvotes: 0
Views: 187
Reputation: 4527
If I understand correctly, you want to check to see whether some pronunciation
is in your data set. From your first code block, it doesn't seem like you care what word
the match came from.
Therefore, I think we could do:
pron_set = {tuple(pron) for word, pron in all_word_prons}
# do something to get a list of pronunciations to check
for pronunciation in pronunciation_list:
if tuple(pronunciation) in pron_set:
print('pronunctiation')
We construct pron_set
from tuple
s because list
s are not hashable (and can't be used as set members).
Set lookup should be much faster than iterating through the list. I would recommend being familiar with the Python data structures; you never know when a deque
might save you lots of time.
Upvotes: 2
Reputation: 47
Have you considered using Python List Comprehensions as outlined here?
https://docs.python.org/3/tutorial/datastructures.html
In certain cases, list comprehensions can be faster than plain for-loops however it still executes a byte-code level loop. If you're not certain what I mean, checkout this thread: Are list-comprehensions and functional functions faster than "for loops"?
It may be worth a shot to see if this would be faster.
Upvotes: 0