How can I improve my Twitter streaming crawler?

Question

I did a Twitter crawler using the Streaming API. It is programmed in perl using the Net:Twitter:Stream. I think it is retrieving few tweets. I left it tracking TV series tweets last night and I got just 30860 tweets. I think it's a low value. What do you think about it? Is there any other perl library that I can use? I'll put part of my code here to see if there is something wrong.

Thanks everybody

Thiago

use JSON;
use Solr;
use Net::Twitter::Stream;
sub coletar{
    Net::Twitter::Stream->new ( user => $username, pass => $password,
                    callback => \&got_tweet,
                    connection_closed_cb => \&connection_closed,
                    track => $track); #$track has my string search;
    sub connection_closed {
        if ($count==0){
           sleep 10;
        }
        elsif($count==1){
           sleep 20;
        }
        else{
           sleep 240;
        }
        $count++;
        warn "Connection to Twitter closed";
        coletar();              #Recomeçando o download de tweets.
    }
    sub got_tweet {
        $cont=0;
        my ( $tweet, $json ) = @_; 
        # Here I save the tweet in my NoSQL database...
    }
}
coletar;

guyrt · Accepted Answer

There are a few things to keep in mind about the Twitter stream. First of all, if your code is a bottleneck, the queue of incoming tweets will overflow and your connection will die. So it very well may not be your code.

Twitter is limiting the number of Tweets they give to "regular" consumers, and charging for full streams. So you may be hitting rate limits, but it's hard to be sure.

One way to test what percentage of tweets you are getting is to set up a second account that randomly sends tweets that should be caught in your filter. Then count the percentage of those Tweets that are caught in your scraper. If it is not 100%, you are probably getting rate limited.

How can I improve my Twitter streaming crawler?

Answers (1)

Related Questions