Alex
Alex

Reputation: 81

NLTK-based stemming and lemmatization

I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.

from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)

The final output I get is:

This beautiful day16~ . I ; working exercise45.^^^45 text34 .

And expected output should look like:

This beautiful day I work exercise text

Upvotes: 1

Views: 2259

Answers (3)

Piyush Agrawal
Piyush Agrawal

Reputation: 111

To process a tweet properly you can use following code:

import re
import nltk
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
""" Normalizes case and handles punctuation
Inputs:
    text: str: raw text
    lemmatizer: an instance of a class implementing the lemmatize() method
                (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
Outputs:
    list(str): tokenized text
"""
bcd=[]
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
text1= text.lower()
text1= re.sub(pattern,"", text1)
text1= text1.replace("'s "," ")
text1= text1.replace("'","")
text1= text1.replace("—", " ")
table= str.maketrans(string.punctuation,32*" ")
text1= text1.translate(table)
geek= nltk.word_tokenize(text1)



abc=nltk.pos_tag(geek)
output = []
for value in abc:
    value = list(value)
    if value[1][0] =="N":
        value[1] = 'n'
    elif value[1][0] =="V":
        value[1] = 'v'
    elif value[1][0] =="J":
        value[1] = 'a'
    elif value[1][0] =="R":
        value[1] = 'r'
    else:
        value[1]='n'
    output.append(value)
abc=output
for value in abc:
    bcd.append(lemmatizer.lemmatize(value[0],pos=value[1]))
return bcd

here I have use post_tag (only N,V,J,R and converted rest all into noun as well). This will return a tokenized and lemmatized list of words.

Upvotes: 0

cs95
cs95

Reputation: 402463

No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).

import re

__stop_words = set(nltk.corpus.stopwords.words('english'))

def clean(tweet):
    cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
    return ' '.join([lemmatizer.lemmatize(i, 'v') 
                for i in cleaned_tweet.split() if i not in __stop_words])

Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.

from nltk.stem.porter import PorterStemmer  
stemmer = PorterStemmer() 

And, call the stemmer like this:

stemmer.stem(i)

Upvotes: 2

E_K
E_K

Reputation: 104

I think this is what you're looking for, but do this prior to calling the lemmatizer as the commenter noted.

>>>import re
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34."
>>>s = re.sub(r'[^A-Za-z ]', '', s)
This is a beautiful day I am working on an exercise text

Upvotes: 1

Related Questions