Reputation: 509

Twitter Sentiment Analysis - Remove bot duplication for more accurate results

Just to preface, this code is from a great guy on Github / Youtube: https://github.com/the-javapocalypse/

I made some minor tweaks for my personal use.

One thing that has always stood between myself and sentiment analysis on twitter is the fact that so many bots posts exist. I figure if I cannot avoid the bots altogether, maybe I can just remove duplication to hedge the impact.

For example - "#bitcoin" or "#btc" - Bot accounts exist under many different handles posting the same exact tweet. It could say "It's going to the moon! Buy now #btc or forever regret it! Buy, buy, buy! Here's a link to my personal site [insert personal site url here]"

This would seem like a positive sentiment post. If 25 accounts post this 2 times per account, we have some inflation if I am only analyzing the recent 500 tweets containing "#btc"

So to my question:

What is an effective way to remove duplication before writing to the csv file? I was thinking of inputting a simple if statement and point to an array to check if it exists already. There is an issue with this. Say I input 1000 tweets to analyze. If 500 of these are duplication from bots, my 1000 tweet analysis just became a 501 tweet analysis. This leads to my next question
What is a way to include a check for duplication and if there is duplication add 1 each time to my total request for tweets to analyze. Example - I want to analyze 1000 tweets. Duplication was found one time, so there are 999 unique tweets to include in the analysis. I want the script to analyze one more to make it 1000 unique tweets (1001 tweets including the 1 duplicate)
Small change, but I think it would be effective to know how to remove all tweets with hyperlinks embedded. This would play into the objective of question 2 by compensating for dropping hyperlink tweets. Example - I want to analyze 1000 tweets. 500 of the 1000 have embedded URLs. The 500 are removed from the analysis. I am now down to 500 tweets. I still want 1000. Script needs to keep fetching non URL, non duplicates until 1000 unique non URL tweets have been accounted for.

See below for the entire script:

import tweepy
import csv
import re
from textblob import TextBlob
import matplotlib.pyplot as plt


class SentimentAnalysis:

    def __init__(self):
        self.tweets = []
        self.tweetText = []

    def DownloadData(self):
        # authenticating
        consumerKey = ''
        consumerSecret = ''
        accessToken = ''
        accessTokenSecret = ''
        auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
        auth.set_access_token(accessToken, accessTokenSecret)
        api = tweepy.API(auth)

        # input for term to be searched and how many tweets to search
        searchTerm = input("Enter Keyword/Tag to search about: ")
        NoOfTerms = int(input("Enter how many tweets to search: "))

        # searching for tweets
        self.tweets = tweepy.Cursor(api.search, q=searchTerm, lang="en").items(NoOfTerms)

        csvFile = open('result.csv', 'a')

        csvWriter = csv.writer(csvFile)

        # creating some variables to store info
        polarity = 0
        positive = 0
        negative = 0
        neutral = 0

        # iterating through tweets fetched
        for tweet in self.tweets:
            # Append to temp so that we can store in csv later. I use encode UTF-8
            self.tweetText.append(self.cleanTweet(tweet.text).encode('utf-8'))
            analysis = TextBlob(tweet.text)
            # print(analysis.sentiment)  # print tweet's polarity
            polarity += analysis.sentiment.polarity  # adding up polarities

            if (analysis.sentiment.polarity == 0):  # adding reaction
                neutral += 1
            elif (analysis.sentiment.polarity > 0.0):
                positive += 1
            else:
                negative += 1

        csvWriter.writerow(self.tweetText)
        csvFile.close()

        # finding average of how people are reacting
        positive = self.percentage(positive, NoOfTerms)
        negative = self.percentage(negative, NoOfTerms)
        neutral = self.percentage(neutral, NoOfTerms)

        # finding average reaction
        polarity = polarity / NoOfTerms

        # printing out data
        print("How people are reacting on " + searchTerm +
              " by analyzing " + str(NoOfTerms) + " tweets.")
        print()
        print("General Report: ")

        if (polarity == 0):
            print("Neutral")
        elif (polarity > 0.0):
            print("Positive")
        else:
            print("Negative")

        print()
        print("Detailed Report: ")
        print(str(positive) + "% positive")
        print(str(negative) + "% negative")
        print(str(neutral) + "% neutral")

        self.plotPieChart(positive, negative, neutral, searchTerm, NoOfTerms)

    def cleanTweet(self, tweet):
        # Remove Links, Special Characters etc from tweet
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t]) | (\w +:\ / \ / \S +)", " ", tweet).split())

    # function to calculate percentage
    def percentage(self, part, whole):
        temp = 100 * float(part) / float(whole)
        return format(temp, '.2f')

    def plotPieChart(self, positive, negative, neutral, searchTerm, noOfSearchTerms):
        labels = ['Positive [' + str(positive) + '%]', 'Neutral [' + str(neutral) + '%]',
                  'Negative [' + str(negative) + '%]']
        sizes = [positive, neutral, negative]
        colors = ['yellowgreen', 'gold', 'red']
        patches, texts = plt.pie(sizes, colors=colors, startangle=90)
        plt.legend(patches, labels, loc="best")
        plt.title('How people are reacting on ' + searchTerm +
                  ' by analyzing ' + str(noOfSearchTerms) + ' Tweets.')
        plt.axis('equal')
        plt.tight_layout()
        plt.show()


if __name__ == "__main__":
    sa = SentimentAnalysis()
    sa.DownloadData()

Upvotes: 0

Answers (2)

Javapocalypse

Reputation: 2363

Answer to your first question

You can remove duplicates by using this one liner.

self.tweets = list(set(self.tweets))

This will remove every duplicated tweet. Just in case if you want to see it working, here is a simple example

>>> tweets = ['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> print(tweets)
['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> tweets = list(set(tweets))
>>> print(tweets)
['this is a tweet', 'Yet another Tweet']

Answer to your second question

Since now you have already removed duplicates, you can get the number of removed tweets by taking the difference of self.tweets and NoOfTerms

tweets_to_further_scrape = NoOfTerms - self.tweets

Now you can scrape tweets_to_further_scrape number of tweets and repeat this process of removing duplication and scraping until you have found desired number of unique tweets.

Answer to your third question

When iterating tweets list, add this line to remove external links.

tweet.text = ' '.join([i for i in tweet.text.split() if 'http' not in i])

Hope this will help you out. Happy coding!

Upvotes: 2

James

Reputation: 36691

You could simply keep a running count of the tweet instances using a defaultdict. You may want to remove the web addresses as well, in case they are blasting out new shortened URLs.

from collections import defaultdict

def __init__(self):
    ...
    tweet_count = defaultdict(int)

def track_tweet(self, tweet):
    t = self.clean_tweet(tweet)
    self.tweet_count[t] += 1

def clean_tweet(self, tweet):
    t = tweet.lower()
    # any other tweet normalization happens here, such as dropping URLs
    return t

def DownloadData(self):
    ...
    for tweet in self.tweets:
        ...
        # add logic to check for number of repeats in the dictionary.