Reputation: 1133
I have a list of sentences as follows
pylist=['This is an apple', 'This is an orange', 'The pineapple is yellow','A grape is red']
If I define a stopwords list such as
stopwords=['This', 'is', 'an', 'The']
Is there a way for me to apply this to the entire list such that my output is
pylist=['apple','orange','pineapple is yellow','A grape is red']
PS: I tried to use apply
with a function defined to remove stopwords like [removewords(x) for x in pylist]
but wasn't successful (plus not sure if this is the most efficient way).
Thanks!
Upvotes: 2
Views: 348
Reputation: 2526
I think your output is not what you really want. The stopwords 'is' is still included.
My attempt would be the following:
pylist = ['This is an apple', 'This is an orange', 'The pineapple is yellow', 'A grape is red']
stopwords = ['This', 'is', 'an', 'The']
stopwords = set(w.lower() for w in stopwords)
def remove_words(s, stopwords):
s_split = s.split()
s_filtered = [w for w in s_split if not w.lower() in stopwords]
return " ".join(s_filtered)
result = [remove_words(x, stopwords) for x in pylist]
with the result
being
['apple', 'orange', 'pineapple yellow', 'A grape red']
To get a reasonable efficient search (look-up in a set takes of course constant time), I stored the lower-case form of the stop words in a set. Usually removal of stop words should be case insensitive.
Side-note: It is very often helpful or even necessary to remove stop words. But please be aware of the fact that there are cases where stop word-removal is not advisable: https://towardsdatascience.com/why-you-should-avoid-removing-stopwords-aa7a353d2a52
Update: When you are really sure that you need to get rid of all possible stop words, make sure you do not miss any - take yatu's advise: Have a look at nltk. Especially if in the next year you might be faced with the problem of having to add Spanish palabra de paradas, French mot d'arrêt and German Stopp-Wörter.
Upvotes: 2
Reputation: 88305
You could use a nested list comprehension, and define stopwords
as a set
to reduce the lookup complexity to O(1)
:
pylist=['This is an apple', 'This is an orange', 'The pineapple is yellow',
'A grape is red']
stopwords = set(['This', 'is', 'an', 'The'])
[' '.join([w for w in s.split() if w not in stopwords]) for s in pylist]
# ['apple', 'orange', 'pineapple yellow', 'A grape red']
Note however, that for a more general approach you can use the stopwords
from nltk
's english corpus:
from nltk.corpus import stopwords
stop_w = set(stopwords.words('english'))
[' '.join([w for w in s.split() if w.lower() not in stop_w]) for s in pylist]
# ['apple', 'orange', 'pineapple yellow', 'grape red']
Upvotes: 1