Reputation: 1722
I have a function:
def remove_stopwords(text):
return [[word for word in simple_preprocess(str(doc), min_len = 2) if word not in stop_words] for doc in texts]
My input is a list with a tokenized sentence:
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
Assume that stop_words
contains the words: 'this', 'is', 'an', 'of' and 'my', then the output I would like to get is:
desired_output = ['example', 'input']
However, the actual output that I'm getting now is:
actual_output = [[], [], [], ['example'], [], [], ['input']]
How can I adjust my code, to get this output?
Upvotes: 0
Views: 54
Reputation: 197
There are two solutions to your problem:
Your remove_stopwords
requires an array of documents to work properly, so you modify your input like this
input = [['This', 'is', 'an', 'example', 'of', 'my', 'input']]
You change your remove_stopwords
function to work on a single document
def remove_stopwords(text):
return [word for word in simple_preprocess(str(text), min_len = 2) if word not in stop_words]
Upvotes: 2
Reputation: 259
You can use the below code for removing stopwords, if there is no specific reason to use your code.
wordsFiltered = []
def remove_stopwords(text):
for w in text:
if w not in stop_words:
wordsFiltered.append(w)
return wordsFiltered
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
stop_words = ['This', 'is', 'an', 'of', 'my']
print remove_stopwords(input)
Output:
['example', 'input']
Upvotes: 1