How to only return actual tokens, rather than empty variables when tokenizing?

Question

I have a function:

def remove_stopwords(text):
     return [[word for word in simple_preprocess(str(doc), min_len = 2) if word not in stop_words] for doc in texts]

My input is a list with a tokenized sentence:

input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']

Assume that stop_words contains the words: 'this', 'is', 'an', 'of' and 'my', then the output I would like to get is:

desired_output = ['example', 'input']

However, the actual output that I'm getting now is:

actual_output = [[], [], [], ['example'], [], [], ['input']]

How can I adjust my code, to get this output?

Chirag · Accepted Answer

You can use the below code for removing stopwords, if there is no specific reason to use your code.

wordsFiltered = []
def remove_stopwords(text):
    for w in text:
        if w not in stop_words:
            wordsFiltered.append(w)
    return wordsFiltered

input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']

stop_words = ['This', 'is', 'an', 'of', 'my']

print remove_stopwords(input)

Output:

['example', 'input']

How to only return actual tokens, rather than empty variables when tokenizing?

Answers (2)

Solution 1:

Solution 2:

Related Questions