Reputation: 1291
I have written a function to clean my text corpus, which is of the following form:
["wild things is a suspenseful .. twists . ",
"i know it already.. film goers . ",
.....,
"touchstone pictures..about it . okay ? "]
which is a list with the sentences separated by commas.
my function is:
def clean_sentences(sentences):
sentences = (re.sub(r'\d+','£', s) for s in sentences
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it']
sentences = ' '.join(w for w in sentences if w not in stopwords)
return sentences
It replaces the numbers with '£' but it does not remove the stopwords.
Output:
'wild things is a suspenseful thriller...
and a £ . £ rating , it\'s still watchable , just don\'t think about it . okay ? '
I dont understand why. thank you.
Upvotes: 0
Views: 84
Reputation: 1086
I believe it's because you used regex to substitute digits for the symbol £ in your code. For clarification: sentences = (re.sub(r'\d+','£', s) for s in sentences
This is a piece of code that replaces any digits with that symbol. I see that you define your list of stopwords, and then make a new list without those stopwords. However, the symbol £
you replaced your numbers with is not in the list of stopwords, therefore it won't be excluded in your new list. You could try adding that to your list of stopwords like so:
def clean_sentences(sentences):
sentences = (re.sub(r'\d+','£', s) for s in sentences)
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it', '£']
sentences = ' '.join(w for w in sentences if w not in stopwords)
return sentences
Hope this helps!
EDIT:
I also believe it may be a problem with your original code. It seems that you are trying to use sentences = ' '.join(w for w in sentences if w not in stopwords)
to join together your sentences and take out any stop words. However, this is an invalid use of how the not in
operator works. The not in
operator only checks for a specific word in a list, not the entire sentence. Basically, it will not take anything out using your stopwords because it cannot detect if there is a stopword in the entire sentence. What you would want to do is split each sentence into a bunch of words first, then make a new list using the same .join
method you already made. This would make it so that the not in
operator can check each word and remove it if it is a stopword.
Upvotes: 1
Reputation: 3174
You compare the whole sentences to the stopwords when you actually want to compare words within the sentences to the stopwords.
import re
sentences = ["wild things is a suspenseful .. twists . ",
"i know it already.. film goers . ",
"touchstone pictures..about it . okay ? "]
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is', 'it']
As a loop:
def clean_sentences(sentences):
new_sentences = []
for sentence in sentences:
new_sentence = sentence.split()
new_sentence = [re.sub(r'\d+', '£', word) for word in new_sentence]
new_sentence = [word for word in new_sentence if word not in stopwords]
new_sentence = " ".join(new_sentence)
new_sentences.append(new_sentence)
return new_sentences
Or, much more compact, as a list comprehension:
def clean_sentences(sentences):
return [" ".join([re.sub(r'\d+', '£', word) for word in sentence.split() if word not in stopwords]) for sentence in sentences]
Which both return:
print(clean_sentences(sentences))
> ['wild things suspenseful .. twists .', 'i know already.. film goers .', 'touchstone pictures..about . okay ?']
Upvotes: 1