Optimizing Python algorithm

Question

I am running my code off a 10yr old potato computer (i5 with 4GB RAM) and need to do a lot of language processing with NLTK. I cannot afford a new computer yet. I wrote a simple function (as part of a bigger program). Problem is, I do not know which is more efficient, requires less computing power and is quicker for processing overall?

This snippet uses more variables:

import nltk
from nltk.tokenize import PunktSentenceTokenizer    #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
    tagged = []
    training_text = open(training_file,"r")
    target_text = open(target_file,"r")
    custom_sent_tokenizer = PunktSentenceTokenizer(training_text.read())  #You need to train the tokenizer on sample data.
    tokenized = custom_sent_tokenizer.tokenize(target_text.read())    #Use the trained tokenizer to tag your target file.
    for i in tokenized: 
        words = nltk.word_tokenize(i)
        tagging = nltk.pos_tag(words)
        tagged.append(tagging)
    training_text.close()   #ALWAYS close opened files! This is why I have included the extra code to this function!
    target_text.close()     #ALWAYS close opened files! This is why I have included the extra code to this function!
    return tagged

Or is this more efficient? I actually prefer this:

import nltk
from nltk.tokenize import PunktSentenceTokenizer    #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
    tagged = []
    training_text = open(training_file,"r")
    target_text = open(target_file,"r")
    #Use the trained tokenizer to tag your target file.
    for i in PunktSentenceTokenizer(training_text.read()).tokenize(target_text.read()): tagged.append(nltk.pos_tag(nltk.word_tokenize(i)))        
    training_text.close()   #ALWAYS close opened files! This is why I have included the extra code to this function!
    target_text.close()     #ALWAYS close opened files! This is why I have included the extra code to this function!
    return tagged

Does anyone have any other suggestions for optimizing code?

Ilia Gilmijarow · Accepted Answer

It does not matter which one you choose. The bulk of the computation is likely done by the tokenizer, not by the for loop in the presented code. Moreover the two examples do the same, except one of them has fewer explicit variables, but still the data needs to be stored somewhere.

Usually, algorithmic speedups come from clever elimination of loop iterations, e.g. in sorting algorithms speedups may come from avoiding value comparisons that will not result in a change to the order of elements (ones that don't advance the sorting). Here the number of loop iterations is the same in both cases.

Optimizing Python algorithm

Answers (2)

Related Questions