klabanus
klabanus

Reputation: 75

Speed up n-gram processing

I have a uniqueWordList with lots of words (100.000+). Trigrams of every one of those words are in the set allTriGrams.

I want to build a dictionary which has all the unique trigrams as keys and all the words which those trigrams can be matched with as values.

Example:

epicDict = {‘ban’:[‘banana’,’banned’],’nan’:[‘banana’]}

My code so far:

for value in allTriGrams:   
    for word in uniqueWordList:
        if value in word:
            epicDict.setdefault(value,[]).append(word)

My problem: This method takes a LOT of time. Is there any way to speed up this process?

Upvotes: 3

Views: 159

Answers (2)

Julian Go
Julian Go

Reputation: 4492

Among simple solutions, I expect this to be faster:

epicDict = collections.defaultdict(set)
for word in uniqueWordList:
  for trigram in [word[x:x+3] for x in range(len(word)-2)]:
    epicDict[trigram].add(word)

Upvotes: 0

idjaw
idjaw

Reputation: 26578

What if uniqueWordList was a set instead, then you can do this instead:

if value in uniqueWordList:
    epicDict.setdefault(value,[]).append(word)

Check this out: Python Sets vs Lists

Upvotes: 2

Related Questions