Emil
Emil

Reputation: 1722

How to only return actual tokens, rather than empty variables when tokenizing?

I have a function:

def remove_stopwords(text):
     return [[word for word in simple_preprocess(str(doc), min_len = 2) if word not in stop_words] for doc in texts] 

My input is a list with a tokenized sentence:

input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']

Assume that stop_words contains the words: 'this', 'is', 'an', 'of' and 'my', then the output I would like to get is:

desired_output = ['example', 'input']

However, the actual output that I'm getting now is:

actual_output = [[], [], [], ['example'], [], [], ['input']]

How can I adjust my code, to get this output?

Upvotes: 0

Views: 54

Answers (2)

prashantpiyush
prashantpiyush

Reputation: 197

There are two solutions to your problem:

Solution 1:

Your remove_stopwords requires an array of documents to work properly, so you modify your input like this

input = [['This', 'is', 'an', 'example', 'of', 'my', 'input']]

Solution 2:

You change your remove_stopwords function to work on a single document

def remove_stopwords(text):
     return [word for word in simple_preprocess(str(text), min_len = 2) if word not in stop_words]

Upvotes: 2

Chirag
Chirag

Reputation: 259

You can use the below code for removing stopwords, if there is no specific reason to use your code.

wordsFiltered = []
def remove_stopwords(text):
    for w in text:
        if w not in stop_words:
            wordsFiltered.append(w)
    return wordsFiltered

input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']

stop_words = ['This', 'is', 'an', 'of', 'my']

print remove_stopwords(input)

Output:

['example', 'input']

Upvotes: 1

Related Questions