Reputation: 115
I am pretty new to the python environment (jupyter notebook), and I am trying to work on a relatively huge text data. I want to process it by applying the following steps and in the same order:
strip whitespaces, lower case, stemming, remove punctuation but preserve intra-word dashes or hyphens, remove stopwords, remove symbols, Strip whitespaces,
I was hoping I could get a single function that could perform the task, instead of doing them individually, is there any single library and/or function out there that could help? if not, what could be the simplest way of defining a function to perform them just with one run?
Upvotes: 4
Views: 6239
Reputation: 788
Alternatively, you can also use my pipeline creator class for textual data which I completed recently. Find here in github. demo_pipe.py
covers pretty much what you want to do.
Upvotes: 0
Reputation: 8585
As mentioned in a comment, it can be done using a combination of multiple libraries in Python. One function that can perform it all could look like this:
import nltk
import re
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer # or LancasterStemmer, RegexpStemmer, SnowballStemmer
default_stemmer = PorterStemmer()
default_stopwords = stopwords.words('english') # or any other list of your choice
def clean_text(text, ):
def tokenize_text(text):
return [w for s in sent_tokenize(text) for w in word_tokenize(s)]
def remove_special_characters(text, characters=string.punctuation.replace('-', '')):
tokens = tokenize_text(text)
pattern = re.compile('[{}]'.format(re.escape(characters)))
return ' '.join(filter(None, [pattern.sub('', t) for t in tokens]))
def stem_text(text, stemmer=default_stemmer):
tokens = tokenize_text(text)
return ' '.join([stemmer.stem(t) for t in tokens])
def remove_stopwords(text, stop_words=default_stopwords):
tokens = [w for w in tokenize_text(text) if w not in stop_words]
return ' '.join(tokens)
text = text.strip(' ') # strip whitespaces
text = text.lower() # lowercase
text = stem_text(text) # stemming
text = remove_special_characters(text) # remove punctuation and symbols
text = remove_stopwords(text) # remove stopwords
#text.strip(' ') # strip whitespaces again?
return text
Testing it with (Python2.7 but should work in Python3. as well):
text = ' Test text !@$%$(%)^ just words and word-word'
clean_text(text)
results in:
u'test text word word-word'
Upvotes: 5