Reputation: 807
I have a .xlsx file with some raw text. I'm reading the file into a DataFrame, then trying to remove symbols and stopwords from it. I do have functions for both needs already implemented, but I keep running into the following problems:
If I remove symbols before removing stopwords, things like "isnt", "theyre", etc., stay in the dataframe.
If I remove stopwords before symbols, things like "(the" aren't counted as stopwords and stay on the dataframe.
Here's what symbol removal looks like:
regex = r'[^\w\s]'
self.dataframe = self.dataframe.replace(regex, '', regex=True)
And stopword removal:
self.dataframe[col] = column.apply(lambda x: ' '.join(
[item for item in x.split() if item not in stops]))
Is there an elegant solution for this? Any suggestions are also appreciated.
Upvotes: 0
Views: 979
Reputation: 385
We first have to replace short form words to full form words to make it more readable such as replace they're with they are, I'd with I would, I'll with I will, won't with would not, etc. Once we have words that is better readable, stopwords can be removed. Refer to the below example for converting from short form words to full form words and then removing stopwords.
import re
sent = "I'll have a bike. They're good. I won't do. I'd be happy"
for i in sent.split():
sent_replace = re.sub(r"\'re", " are", sent)
sent_replace = re.sub(r"\'d", " would", sent_replace)
sent_replace = re.sub(r"\'ll", " will", sent_replace)
sent_replace = re.sub(r"won't", "would not", sent_replace)
print('Before:', sent)
print('\nAfter:', sent_replace)
no_stop_words = ' '.join(item for item in sent_replace.split() if item not in stopwords.words('english'))
print('\nNo stop words:', no_stop_words)
Refer to the below snapshot for the output
Upvotes: 1