Reputation: 189
I would like to build a data set of tweets on a certain keyword, using the Twitter Streaming API and Tweepy module for Python.
So far so good, but does anyone know how to receive tweets that are exactly the same (mostly retweets) just once ? For my data analysis is itsn't really useful to receive the same tweet multiple times.
Is there a maybe a filter that removes tweets that were already downloaded to the data set?
Upvotes: 1
Views: 1838
Reputation: 5308
There are 2 cases here:
1) The tweets exactly match 2) The tweet is nearly the same
Both both cases, this is what I do (you can select your own similarity_threshold):
from difflib import SequenceMatcher
similarity_threshold = 0.7
def similarity(a, b):
return SequenceMatcher(None, a, b).ratio()
latest_tweets = ()
duplicate_tweet = next((t for t in latest_tweets if similarity(data.text, t) > similarity_threshold), None)
def on_status(self, data):
tw = next((t for t in latest_tweets if similarity(data.text, t) > similarity_threshold), None)
if tw == None:
## this is a new tweet
latest_tweets.append(tw)
return True
Upvotes: 1
Reputation: 11100
If you are finding that the runtime is inapplicable for data of a given size then it is time to do something better. Some Ad-Hawk hashing might be to iterate through the batch that you fetch and store it in a dictionary of sets where the keys are the number of each letter / some bucket size. This will partition your tweets in to more reasonable sets and allow for operations in linear time to be reduced by some constant factor, based on your bucket size. Defining your has vector will determine the behaviour of the resulting data object. For example if you only use alphabet characters, than clones with extra quotation marks and emogies will likely be in the same bucket given a large enough bucket size. on the other hand if you hashed on the number of different did gets in the tweet you probably wouldn't see much effect.
setOfTweets = ['this is a tweet #Twitter','this is another tweet.']
alphabetLetters=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
MyHash={} # not actually a pythonic hash
for k in setofTweets:
counts = {'a': 0, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'g': 0, 'f': 0, 'i': 0, 'h': 0, 'k': 0, 'j': 0, 'm': 0, 'l': 0, 'o': 0, 'n': 0, 'q': 0, 'p': 0, 's': 0, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'v': 0, 'y': 0, 'x': 0, 'z': 0}
twiddle = False
for k2 in k:
try:
counts[k2.lower()]+=1
except(KeyError):
twiddle = !twiddle
key = tuple([counts[k]/3 for k in alphabetLetters])
try:
MyHash[key].add(k)
except(KeyError):
MyHash[key]=set()
MyHash[key].add(k)
I don't want to call this a linear filter because the load factor on the buckets will be greater than 1. But it is defiantly faster than one big set when the data is large.
Upvotes: 0
Reputation: 11100
You could make a set of tweet text's
setOfTweets = set(['this is a tweet #Twitter','this is another tweet.'])
print(setOfTweets)
set(['this is another tweet.', 'this is a tweet #Twitter'])
setOfTweets.add('this is a new tweet')
setOfTweets.add('this is another tweet.')#Duplicate is not added
print(setOfTweets)
set(['this is another tweet.', 'this is a new tweet', 'this is a tweet #Twitter'])
Upvotes: -1