Struggling with removing stop words using nltk

Question

I'm trying to remove the stop words from "I don't like ice cream." I have defined:

stop_words = set(nltk.corpus.stopwords.words('english'))

and the function

def stop_word_remover(text):
    return [word for word in text if word.lower() not in stop_words]

But when I apply the function to the string in question, I get this list:

[' ', 'n', '’', ' ', 'l', 'k', 'e', ' ', 'c', 'e', ' ', 'c', 'r', 'e', '.']

which, when joining the strings together as in ''.join(stop_word_remover("I don’t like ice cream.")), I get

' n’ lke ce cre.'

which is not what I was expecting.

Any tips on where have I gone wrong?

Masoud Gheisari · Accepted Answer

word for word in text iterates over characters of text (not over words!) you should change your code as below:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize 

stop_words = set(nltk.corpus.stopwords.words('english'))

def stop_word_remover(text):
    word_tokens = word_tokenize(text)
    word_list = [word for word in word_tokens if word.lower() not in stop_words]
    return " ".join(word_list)

stop_word_remover("I don't like ice cream.")

## 'n't like ice cream .'

Struggling with removing stop words using nltk

Answers (1)

Related Questions