Brianna Drew
Brianna Drew

Reputation: 109

Why won't my program filter out stop words and punctuation as I programmed it to do? (Python & NLTK)

for a lab in my Data Science course I had to create a program in Python using NLTK for natural language processing. We have to use a for loop to iterate over each word of macbeth and filter out all English stop words and punctuation by adding non-stop word/punctuation words to another list. Then, we have to print out a list of the most common words and their frequencies from that filtered list. I had thought that I had done everything correct logically, but the results include punctuation and stop words (see below). What am I doing wrong here? (P.S. this is my first time using NLTK).

Program:

# import required libraries and modules
import nltk
from nltk.corpus import gutenberg, stopwords
from nltk.probability import FreqDist

macbeth_allwords = gutenberg.words('shakespeare-macbeth.txt') # read in words from macbeth
macbeth_noStop = [] # empty list to hold words from macbeth excluding stopwords
punctuations = [".", "!", "?", ",", ";", ":", "-", "[", "]", "{", "}", "(", ")", "/", "*", "~",
"<", ">", "`", "^", "_", "|", "#", "$", "%", "+", "=", "&", "@", " "] # list of common punctuation characters

# iterate through each word in macbeth, making a new list excluding all the stopwords and punctuation characters
for word in macbeth_allwords:
    if (word not in stopwords.words('english')) or (word not in punctuations):
        macbeth_noStop.append(word)

macbeth_freq = FreqDist(macbeth_noStop) # get word frequencies from the filtered list of words from macbeth

# print the 50 most common words from the filtered list of words from macbeth
print("50 Most Common Words in Macbeth (no stopwords or punctuation):")
print("-----------------------------------------------")
print(macbeth_freq.most_common(50))

Output:

50 Most Common Words in Macbeth (no stopwords or punctuation):
-----------------------------------------------
[(',', 1962), ('.', 1235), ("'", 637), ('the', 531), (':', 477), ('and', 376), ('I', 333), ('of', 315), ('to', 311), ('?', 241), ('d', 224), ('a', 214), ('you', 184), ('in', 173), ('my', 170), ('And', 170), ('is', 166), ('that', 158), ('not', 155), ('it', 138), ('Macb', 137), ('with', 134), ('s', 131), ('his', 129), ('be', 124), ('The', 118), ('haue', 117), ('me', 111), ('your', 110), ('our', 103), ('-', 100), ('him', 90), ('for', 82), ('Enter', 80), ('That', 80), ('this', 79), ('he', 76), ('What', 74), ('To', 73), ('so', 70), ('all', 67), ('thou', 63), ('are', 63), ('will', 62), ('Macbeth', 61), ('thee', 61), ('but', 60), ('But', 60), ('on', 59), ('they', 58)]

Upvotes: 0

Views: 582

Answers (3)

justMe
justMe

Reputation: 1

I guess this would be slightly more efficient (and still readable):

[word for word in tokenized if not (word in nltk.corpus.stopwords.words("english") or word in string.punctuation)]

Upvotes: 0

Subigya Upadhyay
Subigya Upadhyay

Reputation: 266

Like mentioned in the earlier answer, the operator used is incorrect.

macbeth_noStop = [token for token in macbeth_allwords if token not in string.punctuation and token not in stopwords.words('english')] 

Also, you could import string and use string.punctuation instead.

Upvotes: 0

Ayush
Ayush

Reputation: 1620

Everything is right except the logical condition. You meant to use and instead of or

if word not in stopwords.word('english') and word not in punctuations

Pedantic note: You could use a set instead of list for the punctuations, that way lookup would be faster :)

Upvotes: 2

Related Questions