Reputation: 245
I want to stem my text, which I am reading from CSV file. But after the stem-operator the text is not changed. Than I have read somewhere that I need to use POS tags in order to stem but it didn't help.
Can you please tell me what I am doing wrong? So I am reading the csv, removing punctuation, tokenizing, getting POS tags, and trying to stem but nothing is changing.
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import nltk
from nltk import pos_tag
stemmer = nltk.PorterStemmer()
data = pd.read_csv(open('data.csv'),sep=';')
translator=str.maketrans('','',string.punctuation)
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=';',
quotechar='^', quoting=csv.QUOTE_MINIMAL)
for line in data['sent']:
line = line.translate(translator)
tokens = word_tokenize(line)
tokens_pos = nltk.pos_tag(tokens)
final = [stemmer.stem(tagged_word[0]) for tagged_word in tokens_pos]
writer.writerow(tokens_pos)
Examples of data for stemming:
The question was, what are you going to cut?
Well, again, while you were on the board of the Woods Foundation...
We've got some long-term challenges in this economy.
Thank you in advance for any help!
Upvotes: 0
Views: 1631
Reputation: 2981
Your code should print final variable for desired output,instead you are printing tokens_pos :)
Try the following:
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
def preprocess(sentence):
stemmer = nltk.PorterStemmer()
translator=sentence.translate(string.maketrans("",""), string.punctuation)
translator = translator.lower()
tokens = word_tokenize(translator)
final = [stemmer.stem(tagged_word) for tagged_word in tokens]
return " ".join(final)
sentence = "We've got some long-term challenges in this economy."
print "Original: "+ sentence
stemmed=preprocess(sentence)
print "Processed: "+ stemmed
Original: We've got some long-term challenges in this economy.
Processed: weve got some longterm challeng in thi economi
Hope it helps you!
Upvotes: 1
Reputation: 50190
You should have tried to debug your code. If (after necessary imports) you had just tried print(stemmer.stem("challenges"))
, you would have seen that the stemming does work (the above will print "challeng"). Your problem is a small oversight: You collect the stems in final
, but you print tokens_pos
. So the "solution" is this:
writer.writerow(final)
Upvotes: 2