Bluetail
Bluetail

Reputation: 1291

Removing a custom list of stopwords for an nlp task

I have written a function to clean my text corpus, which is of the following form:

["wild things is a suspenseful .. twists .  ",
 "i know it already.. film goers .  ",
.....,
"touchstone pictures..about it .  okay ?  "]

which is a list with the sentences separated by commas.

my function is:

def clean_sentences(sentences):  
   
    sentences = (re.sub(r'\d+','£', s) for s in sentences
 
    stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it']
       
    sentences = ' '.join(w for w in sentences if w not in stopwords)

    return sentences 

It replaces the numbers with '£' but it does not remove the stopwords.

Output:

'wild things is a suspenseful thriller...

and a £ . £ rating , it\'s still watchable , just don\'t think about it .  okay ?  '

I dont understand why. thank you.

Upvotes: 0

Views: 84

Answers (2)

Alan Shiah
Alan Shiah

Reputation: 1086

I believe it's because you used regex to substitute digits for the symbol £ in your code. For clarification: sentences = (re.sub(r'\d+','£', s) for s in sentences

This is a piece of code that replaces any digits with that symbol. I see that you define your list of stopwords, and then make a new list without those stopwords. However, the symbol £ you replaced your numbers with is not in the list of stopwords, therefore it won't be excluded in your new list. You could try adding that to your list of stopwords like so:

def clean_sentences(sentences):  
   
    sentences = (re.sub(r'\d+','£', s) for s in sentences)
 
    stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it', '£']
       
    sentences = ' '.join(w for w in sentences if w not in stopwords)

    return sentences 

Hope this helps!

EDIT: I also believe it may be a problem with your original code. It seems that you are trying to use sentences = ' '.join(w for w in sentences if w not in stopwords) to join together your sentences and take out any stop words. However, this is an invalid use of how the not in operator works. The not in operator only checks for a specific word in a list, not the entire sentence. Basically, it will not take anything out using your stopwords because it cannot detect if there is a stopword in the entire sentence. What you would want to do is split each sentence into a bunch of words first, then make a new list using the same .join method you already made. This would make it so that the not in operator can check each word and remove it if it is a stopword.

Upvotes: 1

EliasK93
EliasK93

Reputation: 3174

You compare the whole sentences to the stopwords when you actually want to compare words within the sentences to the stopwords.

import re

sentences = ["wild things is a suspenseful .. twists .  ",
             "i know it already.. film goers .  ",
             "touchstone pictures..about it .  okay ?  "]

stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is', 'it']

As a loop:

def clean_sentences(sentences):
    new_sentences = []
    for sentence in sentences:
        new_sentence = sentence.split()
        new_sentence = [re.sub(r'\d+', '£', word) for word in new_sentence]
        new_sentence = [word for word in new_sentence if word not in stopwords]
        new_sentence = " ".join(new_sentence)
        new_sentences.append(new_sentence)
    return new_sentences

Or, much more compact, as a list comprehension:

def clean_sentences(sentences):
    return [" ".join([re.sub(r'\d+', '£', word) for word in sentence.split() if word not in stopwords]) for sentence in sentences]

Which both return:

print(clean_sentences(sentences))
> ['wild things suspenseful .. twists .', 'i know already.. film goers .', 'touchstone pictures..about . okay ?']

Upvotes: 1

Related Questions