Stephan K.
Stephan K.

Reputation: 15712

Exclude words that appears in a list from a string

I have such list:

stopwords = ['a', 'and', 'is']

and such sentence:

sentence = 'A Mule is Eating and drinking.'

Expected output:

reduced = ['mule', 'eating', 'drinking']

I have so far:

reduced = filter(None, re.match(r'\W+', sentence.lower()))

Now how would you filter out the stopwords (Note the upper to lowercase conversion as well as ommitance of punctuation)?

Upvotes: 1

Views: 9316

Answers (7)

Anxo P
Anxo P

Reputation: 759

with this code you will delete the stopwords. It works in PySpark

stopwordsT=["a","about","above","above","across","after","afterwards","again","against","all","almost","alone","along","already","also","although","always","am","among", "amongst", "amoungst","amount", "an","and","another","any","anyhow","anyone","anything","anyway","anywhere","are","around","as","at","back","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","below","beside","besides","between","beyond","bill","both","bottom","but","by","call","can","cannot","cant","co","con","could","couldnt","cry","de","describe","detail","do","done","down","due","during","each","eg","eight","either","eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the"]
sentence = "about esto alone es a una already pruba across para after ver too si top funciona"
lst = sentence.split()
' '.join([w for w in lst if w not in stopwordsT])

Upvotes: 0

user3842449
user3842449

Reputation:

If one is working with textual material it is worthwhile to make the observation that the NLTK (Natural Language Toolkit) is a "framework" for analyzing text. Not only does it have many of the built in functions one would like when working with text, the NLTK Book is a tutorial for both learning Python and text analysis at the same time. How cool is that!

For example,

from nltk.corpus import stopwords
stopwords.words('english')

gives us a list of 127 stop words in the English language. The first few in that list are: i, me, my, myself, and we. Notice that these words are in lower case.

So the problem stated above, as processed one particular way by NLTK, would look like:

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

raw = 'All human rules are more or less idiotic, I suppose. 
It is best so, no doubt. The way it is now, the asylums can 
hold the sane people, but if we tried to shut up the insane 
we should run out of building materials. -- Mark Twain'

tokenizer = RegexpTokenizer(r'\w+')
words  = tokenizer.tokenize(raw)

sw = stopwords.words('english')

reduce = [w.lower() for w in words if w.lower() not in sw]

The line:

tokenizer = RegexpTokenizer(r'\w+')

is a regex expression used by the tokenizer to strip punctuation. Much of the time it is stems of words that are important. For example, "human" is the stem of "human's," and analysis centers around the noun "human" and not its various forms. If we need to keep such detail the regex can be refined. No doubt there is time to be invested in building robust regex expressions, but practice makes perfect.

If you don't mind the overhead of learning NLTK, for example, because you are doing routine text analysis, then this might be something to look into.

Upvotes: 1

Padraic Cunningham
Padraic Cunningham

Reputation: 180421

You can strip the punctuation:

from string import punctuation
stopwords = set(['a', 'and', 'is'])

sentence = 'A Mule is Eating and drinking.'

print([word.strip(punctuation) for word in sentence.lower().split() if word not in stopwords])
['mule', 'eating', 'drinking']

Using a regex is the wrong approach as you are going to end up splitting single words like "Foo's" in "foo" and "s", if you are going to use a regex don't re.split use findall and filter instead of having to filter out empty strings for no reason:

stopwords = set(['a', 'and', 'is'])

reduced = filter(lambda w: w not in stopwords, re.findall(r"\w+", sentence.lower()))
print(reduced)
['mule', 'eating', 'drinking']

To keep "Mules's" as a single word with a regex:

sentence = 'A Mule"s  Eating and drinking.'
reduced = filter(lambda w: w not in stopwords, re.findall(r"\w+\S\w+|\w+", sentence.lower()))
print(reduced)
'mule"s', 'eating', 'drinking']

Your own regex and the accepted answer will split the word into two parts which I doubt is what you actually want:

In [7]: sentence = 'A Mule"s Eating and drinking.'
In [8]: reduced = filter(lambda w: w not in stopwords, re.split(r'\W+', sentence.lower()))
In [9]: reduced
Out[9]: ['mule', 's', 'eating', 'drinking', '']

Upvotes: 0

Puneet
Puneet

Reputation: 654

stopwords = ['word1', 'word2', 'word3']
sentence = "Word1 Word5 word2 Word4 wORD3"

reduced = sentence.split()

for i in reduced:
    if i.lower() in stopwords:
        reduced.remove(i)

Upvotes: -1

Anand S Kumar
Anand S Kumar

Reputation: 90899

If you are okay with not going with the regex route, you can just use list comprehension , string.split() and check if the string is not in stopwords .

Example -

>>> stopwords = ['a', 'and', 'is']
>>> sentence = 'a mule is eating and drinking'
>>> reduced = [s.lower() for s in sentence.split() if s.lower() not in stopwords]
>>> reduced
['mule', 'eating', 'drinking']

As, a performance benefit, you can also convert the stopwords list to set using the set() function, and then do the lookup in it, as search in set is O(1) .

Upvotes: 1

fferri
fferri

Reputation: 18940

The filter expression is wrong. Change it to:

>>> reduced = filter(lambda w: w not in stopwords, re.split(r'\W+', sentence.lower()))

First argument is the filtering criteria. Also note that to split the sentence you need re.split and not re.match.

>>> list(reduced)
['mule', 'eating', 'drinking']

Upvotes: 2

Maroun
Maroun

Reputation: 95968

You don't need regex to filter the stop words, one way of doing that is by splitting your string and rebuild it without the strings from the list:

lst = sentence.split()
' '.join([w for w in lst if w not in stopwords])

Regex is useful when you have a pattern that repeats itself, not when you want to match exact occurrence.

Upvotes: 1

Related Questions