Reputation: 73
I am trying to apply Sentiment Analysis (predicting negative and positive tweets) on a relatively large Dataset (10000 rows). So far, I achieved only ~73% accuracy using Naive Bayes and my method called "final" shown below to extract features. I want to add PoS to help with the classification, but am completely unsure how to implement it. I tried writing a simple function called "pos" (which I posted below) and attempted using the tags on my cleaned dataset as features, but only got around 52% accuracy this way.. Can anyone lead me in the right direction to implement PoS for my model? Thank you.
def pos(word):
return [t for w, t in nltk.pos_tag(word)]
def final(text):
"""
I have code here to remove URLs,hashtags,
stopwords,usernames,numerals, and punctuation.
"""
#lemmatization
finished = []
for x in clean:
finished.append(lem.lemmatize(x))
return finished
Upvotes: 1
Views: 345
Reputation: 1231
You should first split the tweets into sentences and then tokenize. NLTK provides a method for this.
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(tweet)
After this supply this list of sentences to your nltk.pos_tag
method. That should give accurates POS tags.
Upvotes: 1