Tuma
Tuma

Reputation: 807

Removing stopwords when the sentence contains special characters

I have a .xlsx file with some raw text. I'm reading the file into a DataFrame, then trying to remove symbols and stopwords from it. I do have functions for both needs already implemented, but I keep running into the following problems:

Here's what symbol removal looks like:

regex = r'[^\w\s]'
self.dataframe = self.dataframe.replace(regex, '', regex=True)

And stopword removal:

self.dataframe[col] = column.apply(lambda x: ' '.join(
            [item for item in x.split() if item not in stops]))

Is there an elegant solution for this? Any suggestions are also appreciated.

Upvotes: 0

Views: 979

Answers (1)

Sandeep Panchal
Sandeep Panchal

Reputation: 385

We first have to replace short form words to full form words to make it more readable such as replace they're with they are, I'd with I would, I'll with I will, won't with would not, etc. Once we have words that is better readable, stopwords can be removed. Refer to the below example for converting from short form words to full form words and then removing stopwords.

import re
sent = "I'll have a bike. They're good. I won't do. I'd be happy"
for i in sent.split():
    sent_replace = re.sub(r"\'re", " are", sent)
    sent_replace = re.sub(r"\'d", " would", sent_replace)
    sent_replace = re.sub(r"\'ll", " will", sent_replace)
    sent_replace = re.sub(r"won't", "would not", sent_replace)

print('Before:', sent)
print('\nAfter:', sent_replace)

no_stop_words = ' '.join(item for item in sent_replace.split() if item not in stopwords.words('english'))
print('\nNo stop words:', no_stop_words)

Refer to the below snapshot for the output

enter image description here

Upvotes: 1

Related Questions