Reputation: 509
Just to preface, this code is from a great guy on Github / Youtube: https://github.com/the-javapocalypse/
I made some minor tweaks for my personal use.
One thing that has always stood between myself and sentiment analysis on twitter is the fact that so many bots posts exist. I figure if I cannot avoid the bots altogether, maybe I can just remove duplication to hedge the impact.
For example - "#bitcoin" or "#btc" - Bot accounts exist under many different handles posting the same exact tweet. It could say "It's going to the moon! Buy now #btc or forever regret it! Buy, buy, buy! Here's a link to my personal site [insert personal site url here]"
This would seem like a positive sentiment post. If 25 accounts post this 2 times per account, we have some inflation if I am only analyzing the recent 500 tweets containing "#btc"
So to my question:
See below for the entire script:
import tweepy
import csv
import re
from textblob import TextBlob
import matplotlib.pyplot as plt
class SentimentAnalysis:
def __init__(self):
self.tweets = []
self.tweetText = []
def DownloadData(self):
# authenticating
consumerKey = ''
consumerSecret = ''
accessToken = ''
accessTokenSecret = ''
auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth)
# input for term to be searched and how many tweets to search
searchTerm = input("Enter Keyword/Tag to search about: ")
NoOfTerms = int(input("Enter how many tweets to search: "))
# searching for tweets
self.tweets = tweepy.Cursor(api.search, q=searchTerm, lang="en").items(NoOfTerms)
csvFile = open('result.csv', 'a')
csvWriter = csv.writer(csvFile)
# creating some variables to store info
polarity = 0
positive = 0
negative = 0
neutral = 0
# iterating through tweets fetched
for tweet in self.tweets:
# Append to temp so that we can store in csv later. I use encode UTF-8
self.tweetText.append(self.cleanTweet(tweet.text).encode('utf-8'))
analysis = TextBlob(tweet.text)
# print(analysis.sentiment) # print tweet's polarity
polarity += analysis.sentiment.polarity # adding up polarities
if (analysis.sentiment.polarity == 0): # adding reaction
neutral += 1
elif (analysis.sentiment.polarity > 0.0):
positive += 1
else:
negative += 1
csvWriter.writerow(self.tweetText)
csvFile.close()
# finding average of how people are reacting
positive = self.percentage(positive, NoOfTerms)
negative = self.percentage(negative, NoOfTerms)
neutral = self.percentage(neutral, NoOfTerms)
# finding average reaction
polarity = polarity / NoOfTerms
# printing out data
print("How people are reacting on " + searchTerm +
" by analyzing " + str(NoOfTerms) + " tweets.")
print()
print("General Report: ")
if (polarity == 0):
print("Neutral")
elif (polarity > 0.0):
print("Positive")
else:
print("Negative")
print()
print("Detailed Report: ")
print(str(positive) + "% positive")
print(str(negative) + "% negative")
print(str(neutral) + "% neutral")
self.plotPieChart(positive, negative, neutral, searchTerm, NoOfTerms)
def cleanTweet(self, tweet):
# Remove Links, Special Characters etc from tweet
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t]) | (\w +:\ / \ / \S +)", " ", tweet).split())
# function to calculate percentage
def percentage(self, part, whole):
temp = 100 * float(part) / float(whole)
return format(temp, '.2f')
def plotPieChart(self, positive, negative, neutral, searchTerm, noOfSearchTerms):
labels = ['Positive [' + str(positive) + '%]', 'Neutral [' + str(neutral) + '%]',
'Negative [' + str(negative) + '%]']
sizes = [positive, neutral, negative]
colors = ['yellowgreen', 'gold', 'red']
patches, texts = plt.pie(sizes, colors=colors, startangle=90)
plt.legend(patches, labels, loc="best")
plt.title('How people are reacting on ' + searchTerm +
' by analyzing ' + str(noOfSearchTerms) + ' Tweets.')
plt.axis('equal')
plt.tight_layout()
plt.show()
if __name__ == "__main__":
sa = SentimentAnalysis()
sa.DownloadData()
Upvotes: 0
Views: 1418
Reputation: 2363
You can remove duplicates by using this one liner.
self.tweets = list(set(self.tweets))
This will remove every duplicated tweet. Just in case if you want to see it working, here is a simple example
>>> tweets = ['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> print(tweets)
['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> tweets = list(set(tweets))
>>> print(tweets)
['this is a tweet', 'Yet another Tweet']
Since now you have already removed duplicates, you can get the number of removed tweets by taking the difference of self.tweets
and NoOfTerms
tweets_to_further_scrape = NoOfTerms - self.tweets
Now you can scrape tweets_to_further_scrape
number of tweets and repeat this process of removing duplication and scraping until you have found desired number of unique tweets.
When iterating tweets list, add this line to remove external links.
tweet.text = ' '.join([i for i in tweet.text.split() if 'http' not in i])
Hope this will help you out. Happy coding!
Upvotes: 2
Reputation: 36691
You could simply keep a running count of the tweet instances using a defaultdict
. You may want to remove the web addresses as well, in case they are blasting out new shortened URLs.
from collections import defaultdict
def __init__(self):
...
tweet_count = defaultdict(int)
def track_tweet(self, tweet):
t = self.clean_tweet(tweet)
self.tweet_count[t] += 1
def clean_tweet(self, tweet):
t = tweet.lower()
# any other tweet normalization happens here, such as dropping URLs
return t
def DownloadData(self):
...
for tweet in self.tweets:
...
# add logic to check for number of repeats in the dictionary.
Upvotes: 0