Reputation: 844
I am running my code off a 10yr old potato computer (i5 with 4GB RAM) and need to do a lot of language processing with NLTK. I cannot afford a new computer yet. I wrote a simple function (as part of a bigger program). Problem is, I do not know which is more efficient, requires less computing power and is quicker for processing overall?
This snippet uses more variables:
import nltk
from nltk.tokenize import PunktSentenceTokenizer #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
tagged = []
training_text = open(training_file,"r")
target_text = open(target_file,"r")
custom_sent_tokenizer = PunktSentenceTokenizer(training_text.read()) #You need to train the tokenizer on sample data.
tokenized = custom_sent_tokenizer.tokenize(target_text.read()) #Use the trained tokenizer to tag your target file.
for i in tokenized:
words = nltk.word_tokenize(i)
tagging = nltk.pos_tag(words)
tagged.append(tagging)
training_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
target_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
return tagged
Or is this more efficient? I actually prefer this:
import nltk
from nltk.tokenize import PunktSentenceTokenizer #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
tagged = []
training_text = open(training_file,"r")
target_text = open(target_file,"r")
#Use the trained tokenizer to tag your target file.
for i in PunktSentenceTokenizer(training_text.read()).tokenize(target_text.read()): tagged.append(nltk.pos_tag(nltk.word_tokenize(i)))
training_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
target_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
return tagged
Does anyone have any other suggestions for optimizing code?
Upvotes: 0
Views: 75
Reputation: 2006
As mentioned by Daniel, timing functions would be your best way to figure out which method is faster.
I'd recommend using an iPython console to test out the timing for each function.
timeit custom_tagger(training_file, target_file)
I don't think there will be much of a speed difference between the two functions as the second is merely a refactoring of the first. Having all that text on one line won't speed up your code, and it makes it quite difficult to follow. If you're concerned about code length, I'd first clean up the way you read the files. For example:
with open(target_file) as f:
target_text = f.read()
This is much safer as the file is closed immediately after reading. You could also improve the way you name your variables. In your code target_text
is actually a file object, when it actually sounds like it's a string.
Upvotes: 1
Reputation: 1020
It does not matter which one you choose. The bulk of the computation is likely done by the tokenizer, not by the for loop in the presented code. Moreover the two examples do the same, except one of them has fewer explicit variables, but still the data needs to be stored somewhere.
Usually, algorithmic speedups come from clever elimination of loop iterations, e.g. in sorting algorithms speedups may come from avoiding value comparisons that will not result in a change to the order of elements (ones that don't advance the sorting). Here the number of loop iterations is the same in both cases.
Upvotes: 1