How can I make this code work faster ? (searching large corpus of text for large words)

Question

in Python, I have created a text generator that acts on certain parameters but my code is -at most of the time- slow and performs below my expectations. I expect one sentence per every 3-4 minutes but it fails to comply if the database it works on is large -I use the project Gutenberg's 18-book corpus and I will create my custom corpus and add further books so performance is vital.- The algorithm and the implementation is below:

ALGORITHM

1- Enter the trigger sentence -only once, at the beginning of the program-

2- Get the longest word in the trigger sentence

3- Find all the sentences of the corpus that contain the word at step2

4- Randomly select one of those sentences

5- Get the sentence (named sentA to resolve the ambiguity in description) that follows the sentence picked at step4 -so long as sentA is longer than 40 characters-

6- Go to step 2, now the trigger sentence is the sentA of step5

IMPLEMENTATION

from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user

previousLongestWord = ""

listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus

sentenceAppender = ""

longestWord = ""

#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]


def findLongestWord(longestWord):
    if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
        longestWord = sortedSetOfValidWords[-2]
        if(listOfWords.count(longestWord) == 1):
            longestWord = sortedSetOfValidWords[-3]


doappend = corpusSentences.append

def appending():

    for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
        sentenceAppender = " ".join(mysentence)
        doappend(sentenceAppender)


appending()
sentencesContainingLongestWord = []

def getSentence(longestWord, sentencesContainingLongestWord):


    for sentence in corpusSentences:
        if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
            sentencesContainingLongestWord.append(sentence)


def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):

    while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
        sentencesContainingLongestWord.remove(triggerSentence)
        triggerSentence = choice(sentencesContainingLongestWord)
        sentenceIndex = corpusSentences.index(triggerSentence)

while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence

    sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set

    setOfValidWords = [] #set for words in a sentence that exists in a corpus                    

    split_str = triggerSentence.split()#split the sentence into words

    setOfValidWords = [word for word in split_str if listOfWords.count(word)]

    sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))

    longestWord = sortedSetOfValidWords[-1]

    findLongestWord(longestWord)

    previousLongestWord = longestWord

    getSentence(longestWord, sentencesContainingLongestWord)

    triggerSentence = choice(sentencesContainingLongestWord)

    sentenceIndex = corpusSentences.index(triggerSentence)

    lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)

    triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence

    print triggerSentence
    print "
"

    corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers


print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors

The computer I run the code on is a bit old, dual-core CPU was bought in Feb. 2006 and 2x512 RAM was bought in Sept. 2004 so I'm not sure if my implementation is bad or the hardware is a reason for the slow runtime. Any ideas on how I can rescue this from its hazardous form ? Thanks in advance.

Yann Vernier · Accepted Answer

I think my first advice must be: Think carefully about what your routines do, and make sure the name describes that. Currently you have things like:

arraySorter which neither deals with arrays nor sorts (it's an implementation of nub)
findLongestWord which counts things or selects words by criteria not present in the algorithm description, yet ends up doing nothing at all because longestWord is a local variable (argument, as it were)
getSentence which appends an arbitrary number of sentences onto a list
appending which sounds like it might be a state checker, but operates only through side effects
considerable confusion between local and global variables, for instance the global variable sentenceAppender is never used, nor is it an actor (for instance, a function) like the name suggests

For the task itself, what you really need are indices. It might be overkill to index every word - technically you should only need index entries for words that occur as the longest word of a sentence. Dictionaries are your primary tool here, and the second tool is lists. Once you have those indices, looking up a random sentence containing any given word takes only a dictionary lookup, a random.choice, and a list lookup. Perhaps a few list lookups, given the sentence length restriction.

This example should prove a good object lesson that modern hardware or optimizers like Psyco do not solve algorithmic problems.

How can I make this code work faster ? (searching large corpus of text for large words)

Answers (2)

Related Questions