Reputation: 277
in Python, I have created a text generator that acts on certain parameters but my code is -at most of the time- slow and performs below my expectations. I expect one sentence per every 3-4 minutes but it fails to comply if the database it works on is large -I use the project Gutenberg's 18-book corpus and I will create my custom corpus and add further books so performance is vital.- The algorithm and the implementation is below:
ALGORITHM
1- Enter the trigger sentence -only once, at the beginning of the program-
2- Get the longest word in the trigger sentence
3- Find all the sentences of the corpus that contain the word at step2
4- Randomly select one of those sentences
5- Get the sentence (named sentA to resolve the ambiguity in description) that follows the sentence picked at step4 -so long as sentA is longer than 40 characters-
6- Go to step 2, now the trigger sentence is the sentA of step5
IMPLEMENTATION
from nltk.corpus import gutenberg
from random import choice
triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user
previousLongestWord = ""
listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus
sentenceAppender = ""
longestWord = ""
#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
def findLongestWord(longestWord):
if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
longestWord = sortedSetOfValidWords[-2]
if(listOfWords.count(longestWord) == 1):
longestWord = sortedSetOfValidWords[-3]
doappend = corpusSentences.append
def appending():
for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
sentenceAppender = " ".join(mysentence)
doappend(sentenceAppender)
appending()
sentencesContainingLongestWord = []
def getSentence(longestWord, sentencesContainingLongestWord):
for sentence in corpusSentences:
if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
sentencesContainingLongestWord.append(sentence)
def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):
while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
sentencesContainingLongestWord.remove(triggerSentence)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence
sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set
setOfValidWords = [] #set for words in a sentence that exists in a corpus
split_str = triggerSentence.split()#split the sentence into words
setOfValidWords = [word for word in split_str if listOfWords.count(word)]
sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))
longestWord = sortedSetOfValidWords[-1]
findLongestWord(longestWord)
previousLongestWord = longestWord
getSentence(longestWord, sentencesContainingLongestWord)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)
triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence
print triggerSentence
print "\n"
corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers
print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors
The computer I run the code on is a bit old, dual-core CPU was bought in Feb. 2006 and 2x512 RAM was bought in Sept. 2004 so I'm not sure if my implementation is bad or the hardware is a reason for the slow runtime. Any ideas on how I can rescue this from its hazardous form ? Thanks in advance.
Upvotes: 3
Views: 516
Reputation: 15887
I think my first advice must be: Think carefully about what your routines do, and make sure the name describes that. Currently you have things like:
arraySorter
which neither deals with arrays nor sorts (it's an implementation of nub)findLongestWord
which counts things or selects words by criteria not present in the algorithm description, yet ends up doing nothing at all because longestWord is a local variable (argument, as it were)getSentence
which appends an arbitrary number of sentences onto a listappending
which sounds like it might be a state checker, but operates only through side effectssentenceAppender
is never used, nor is it an actor (for instance, a function) like the name suggestsFor the task itself, what you really need are indices. It might be overkill to index every word - technically you should only need index entries for words that occur as the longest word of a sentence. Dictionaries are your primary tool here, and the second tool is lists. Once you have those indices, looking up a random sentence containing any given word takes only a dictionary lookup, a random.choice, and a list lookup. Perhaps a few list lookups, given the sentence length restriction.
This example should prove a good object lesson that modern hardware or optimizers like Psyco do not solve algorithmic problems.
Upvotes: 4