Removing stopwords that begin a sentence with NLTK

Question

I'm attempting to remove all the stop words from text input. The code below removes all the stop words, except one that begin a sentence.

How do I remove those words?

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.corpus import stopwords
stopwords_nltk_en = set(stopwords.words('english'))

from string import punctuation
exclude_punctuation = set(punctuation)

stoplist_combined = set.union(stopwords_nltk_en, exclude_punctuation)

def normalized_text(text):
   lemma = WordNetLemmatizer()
   stopwords_punctuations_free = ' '.join([i for i in text.lower().split() if i not in stoplist_combined])
   normalized = ' '.join(lemma.lemmatize(word) for word in stopwords_punctuations_free.split())
return normalized


sentence = [['The birds are always in their house.'], ['In the hills the birds nest.']]

for item in sentence:
  print (normalized_text(str(item)))

OUTPUT: 
   the bird always house 
   in hill bird nest

Neb · Accepted Answer

The culprit is this line of code:

print (normalized_text(str(item)))

If you try to print str(item) for the first element of your sentence list, you'll get:

['The birds are always in their house.']

which, then, lowered and split becomes:

["['the", 'birds', 'are', 'always', 'in', 'their', "house.']"]

As you can see, the first element is ['the which does not match the stop word the.

Solution: Use ''.join(item) to convert item to str

Edit after comment

Inside the text string there are still some apices '. To solve, call the normalized as:

for item in sentence:
    print (normalized_text(item))

Then, import the regex module with import re and change:

text.lower().split()

with:

re.split('\'| ', ''.join(text).lower())

Removing stopwords that begin a sentence with NLTK

Answers (1)

Related Questions