Dela
Dela

Reputation: 115

Pipeline for text cleaning / processing in python

I am pretty new to the python environment (jupyter notebook), and I am trying to work on a relatively huge text data. I want to process it by applying the following steps and in the same order:

strip whitespaces, lower case, stemming, remove punctuation but preserve intra-word dashes or hyphens, remove stopwords, remove symbols, Strip whitespaces,

I was hoping I could get a single function that could perform the task, instead of doing them individually, is there any single library and/or function out there that could help? if not, what could be the simplest way of defining a function to perform them just with one run?

Upvotes: 4

Views: 6239

Answers (2)

mccandar
mccandar

Reputation: 788

Alternatively, you can also use my pipeline creator class for textual data which I completed recently. Find here in github. demo_pipe.py covers pretty much what you want to do.

Upvotes: 0

Vlad
Vlad

Reputation: 8585

As mentioned in a comment, it can be done using a combination of multiple libraries in Python. One function that can perform it all could look like this:

import nltk
import re
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer # or LancasterStemmer, RegexpStemmer, SnowballStemmer

default_stemmer = PorterStemmer()
default_stopwords = stopwords.words('english') # or any other list of your choice
def clean_text(text, ):

    def tokenize_text(text):
        return [w for s in sent_tokenize(text) for w in word_tokenize(s)]

    def remove_special_characters(text, characters=string.punctuation.replace('-', '')):
        tokens = tokenize_text(text)
        pattern = re.compile('[{}]'.format(re.escape(characters)))
        return ' '.join(filter(None, [pattern.sub('', t) for t in tokens]))

    def stem_text(text, stemmer=default_stemmer):
        tokens = tokenize_text(text)
        return ' '.join([stemmer.stem(t) for t in tokens])

    def remove_stopwords(text, stop_words=default_stopwords):
        tokens = [w for w in tokenize_text(text) if w not in stop_words]
        return ' '.join(tokens)

    text = text.strip(' ') # strip whitespaces
    text = text.lower() # lowercase
    text = stem_text(text) # stemming
    text = remove_special_characters(text) # remove punctuation and symbols
    text = remove_stopwords(text) # remove stopwords
    #text.strip(' ') # strip whitespaces again?

    return text

Testing it with (Python2.7 but should work in Python3. as well):

text = '  Test text !@$%$(%)^   just words and word-word'
clean_text(text)

results in:

u'test text word word-word'

Upvotes: 5

Related Questions