Reputation: 45
Specifically when removing the stop letters from this getwords function.
def getwords(fileName):
file = open(fileName, 'r')
text = file.read()
stopletters = [".", ",", ";", ":", "'s", '"', "!", "?", "(", ")", '“', '”']
text = text.lower()
for letter in stopletters:
text = text.replace(letter, "")
words = text.split()
return words
And for the loop in this bigrams function
def compute_bigrams(fileName):
input_list = getwords(fileName)
bigram_list = {}
for i in range(len(input_list) - 1):
if input_list[i] in bigram_list:
bigram_list[input_list[i]] = bigram_list[input_list[i]] + [input_list[i + 1]]
else :
bigram_list[input_list[i]] = [input_list[i + 1]]
return bigram_list
Upvotes: 1
Views: 73
Reputation: 795
You could rewrite it in this way:
def getwords(file_name):
with open(file_name, 'r') as file:
text = file.read().lower()
stop_letters = (".", ",", ";", ":", "'s", '"', "!", "?", "(", ")", '“', '”')
text = ''.join([letter if letter not in stop_letters else '' for letter in text])
words = text.split()
return words
I used context manager for file open, merged some lines (no need to have a special line for .lower()
) and used list comprehension to go trough text and add letters but only if that letter is not in stop_letters
. After joining that list you get the same results.
Note that you can use generator expression as well which would be even better:
text = ''.join((letter if letter not in stop_letters else '' for letter in text))
And if you really want to save that one line you could just do:
return text.split()
Upvotes: 2
Reputation: 2720
You can do the first replacement without a for loop at all by incorporating a little bit of regex:
import re
pattern = re.compile('''[.,;:"!?()“”]*|'s*''')
pattern.sub('', 'this is a test string (it proves that the replacements work!).')
>>> 'this is a test string it proves that the replacements work'
Though it theoretically is possible to make your second loop into a comprehension, I strongly recommend you don't do it. People, including yourself in a few months' time will have severe problems understanding what it does. As @Alexander Cécile noted in the comments, you can refactor the second loop utilizing for input in input_list
, adding to the performance and readability of your code
Upvotes: 2