FlyingPickle
FlyingPickle

Reputation: 1133

stopword removal in python list

I have a list of sentences as follows

pylist=['This is an apple', 'This is an orange', 'The pineapple is yellow','A grape is red']

If I define a stopwords list such as

stopwords=['This', 'is', 'an', 'The']

Is there a way for me to apply this to the entire list such that my output is

pylist=['apple','orange','pineapple is yellow','A grape is red']

PS: I tried to use apply with a function defined to remove stopwords like [removewords(x) for x in pylist] but wasn't successful (plus not sure if this is the most efficient way). Thanks!

Upvotes: 2

Views: 348

Answers (2)

Lydia van Dyke
Lydia van Dyke

Reputation: 2526

I think your output is not what you really want. The stopwords 'is' is still included.

My attempt would be the following:

pylist = ['This is an apple', 'This is an orange', 'The pineapple is yellow', 'A grape is red']
stopwords = ['This', 'is', 'an', 'The']

stopwords = set(w.lower() for w in stopwords)


def remove_words(s, stopwords):
    s_split = s.split()
    s_filtered = [w for w in s_split if not w.lower() in stopwords]
    return " ".join(s_filtered)


result = [remove_words(x, stopwords) for x in pylist]

with the result being

['apple', 'orange', 'pineapple yellow', 'A grape red']

To get a reasonable efficient search (look-up in a set takes of course constant time), I stored the lower-case form of the stop words in a set. Usually removal of stop words should be case insensitive.

Side-note: It is very often helpful or even necessary to remove stop words. But please be aware of the fact that there are cases where stop word-removal is not advisable: https://towardsdatascience.com/why-you-should-avoid-removing-stopwords-aa7a353d2a52

Update: When you are really sure that you need to get rid of all possible stop words, make sure you do not miss any - take yatu's advise: Have a look at nltk. Especially if in the next year you might be faced with the problem of having to add Spanish palabra de paradas, French mot d'arrêt and German Stopp-Wörter.

Upvotes: 2

yatu
yatu

Reputation: 88305

You could use a nested list comprehension, and define stopwords as a set to reduce the lookup complexity to O(1):

pylist=['This is an apple', 'This is an orange', 'The pineapple is yellow',
        'A grape is red']
stopwords = set(['This', 'is', 'an', 'The'])

[' '.join([w for w in s.split() if w not in stopwords]) for s in pylist]
# ['apple', 'orange', 'pineapple yellow', 'A grape red']

Note however, that for a more general approach you can use the stopwords from nltk's english corpus:

from nltk.corpus import stopwords
stop_w = set(stopwords.words('english'))

[' '.join([w for w in s.split() if w.lower() not in stop_w]) for s in pylist]
# ['apple', 'orange', 'pineapple yellow', 'grape red']

Upvotes: 1

Related Questions