ZverArt
ZverArt

Reputation: 245

Stemming in python

I want to stem my text, which I am reading from CSV file. But after the stem-operator the text is not changed. Than I have read somewhere that I need to use POS tags in order to stem but it didn't help.

Can you please tell me what I am doing wrong? So I am reading the csv, removing punctuation, tokenizing, getting POS tags, and trying to stem but nothing is changing.

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import nltk
from nltk import pos_tag

stemmer = nltk.PorterStemmer()
data = pd.read_csv(open('data.csv'),sep=';')

translator=str.maketrans('','',string.punctuation)

with open('output.csv', 'w', newline='') as csvfile:
   writer = csv.writer(csvfile, delimiter=';',
                            quotechar='^', quoting=csv.QUOTE_MINIMAL)

   for line in data['sent']:
        line = line.translate(translator)
        tokens = word_tokenize(line)
        tokens_pos = nltk.pos_tag(tokens)
        final = [stemmer.stem(tagged_word[0]) for tagged_word in tokens_pos]
        writer.writerow(tokens_pos)

Examples of data for stemming:

The question was, what are you going to cut?
Well, again, while you were on the board of the Woods Foundation...
We've got some long-term challenges in this economy.

Thank you in advance for any help!

Upvotes: 0

Views: 1631

Answers (2)

Kuntal-G
Kuntal-G

Reputation: 2981

Your code should print final variable for desired output,instead you are printing tokens_pos :)

Try the following:

import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer


def preprocess(sentence):
    stemmer = nltk.PorterStemmer()
    translator=sentence.translate(string.maketrans("",""), string.punctuation)
    translator = translator.lower()
    tokens = word_tokenize(translator)
    final = [stemmer.stem(tagged_word) for tagged_word in tokens]
    return " ".join(final)

sentence = "We've got some long-term challenges in this economy."
print "Original: "+ sentence

stemmed=preprocess(sentence)
print "Processed: "+ stemmed

Output:

Original: We've got some long-term challenges in this economy.
Processed: weve got some longterm challeng in thi economi

Hope it helps you!

Upvotes: 1

alexis
alexis

Reputation: 50190

You should have tried to debug your code. If (after necessary imports) you had just tried print(stemmer.stem("challenges")), you would have seen that the stemming does work (the above will print "challeng"). Your problem is a small oversight: You collect the stems in final, but you print tokens_pos. So the "solution" is this:

writer.writerow(final)

Upvotes: 2

Related Questions