Reputation: 189

Receive same tweets only once in Tweepy Streaming API

I would like to build a data set of tweets on a certain keyword, using the Twitter Streaming API and Tweepy module for Python.

So far so good, but does anyone know how to receive tweets that are exactly the same (mostly retweets) just once ? For my data analysis is itsn't really useful to receive the same tweet multiple times.

Is there a maybe a filter that removes tweets that were already downloaded to the data set?

Upvotes: 1

Answers (3)

Mahesh

Reputation: 5308

There are 2 cases here:

1) The tweets exactly match 2) The tweet is nearly the same

Both both cases, this is what I do (you can select your own similarity_threshold):

from difflib import SequenceMatcher

similarity_threshold = 0.7

def similarity(a, b):
        return SequenceMatcher(None, a, b).ratio()

latest_tweets = ()
duplicate_tweet = next((t for t in latest_tweets if similarity(data.text, t) > similarity_threshold), None)

def on_status(self, data):
    tw = next((t for t in latest_tweets if similarity(data.text, t) > similarity_threshold), None)

    if tw == None:
        ## this is a new tweet
        latest_tweets.append(tw)

    return True

Upvotes: 1

kpie

Reputation: 11100

If you are finding that the runtime is inapplicable for data of a given size then it is time to do something better. Some Ad-Hawk hashing might be to iterate through the batch that you fetch and store it in a dictionary of sets where the keys are the number of each letter / some bucket size. This will partition your tweets in to more reasonable sets and allow for operations in linear time to be reduced by some constant factor, based on your bucket size. Defining your has vector will determine the behaviour of the resulting data object. For example if you only use alphabet characters, than clones with extra quotation marks and emogies will likely be in the same bucket given a large enough bucket size. on the other hand if you hashed on the number of different did gets in the tweet you probably wouldn't see much effect.

setOfTweets = ['this is a tweet #Twitter','this is another tweet.']
alphabetLetters=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
MyHash={} # not actually a pythonic hash
for k in setofTweets:
    counts = {'a': 0, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'g': 0, 'f': 0, 'i': 0, 'h': 0, 'k': 0, 'j': 0, 'm': 0, 'l': 0, 'o': 0, 'n': 0, 'q': 0, 'p': 0, 's': 0, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'v': 0, 'y': 0, 'x': 0, 'z': 0}
    twiddle = False
    for k2 in k:
        try:
           counts[k2.lower()]+=1
        except(KeyError):
           twiddle = !twiddle
    key = tuple([counts[k]/3 for k in alphabetLetters])
    try:
        MyHash[key].add(k)
    except(KeyError):
        MyHash[key]=set()
        MyHash[key].add(k)

I don't want to call this a linear filter because the load factor on the buckets will be greater than 1. But it is defiantly faster than one big set when the data is large.

Upvotes: 0

kpie

Reputation: 11100

You could make a set of tweet text's

setOfTweets = set(['this is a tweet #Twitter','this is another tweet.'])
print(setOfTweets)

set(['this is another tweet.', 'this is a tweet #Twitter'])

setOfTweets.add('this is a new tweet')
setOfTweets.add('this is another tweet.')#Duplicate is not added
print(setOfTweets)

set(['this is another tweet.', 'this is a new tweet', 'this is a tweet #Twitter'])

Upvotes: -1

Receive same tweets only once in Tweepy Streaming API

Answers (3)

Related Questions