guyfromnowhere
guyfromnowhere

Reputation: 7

Perl WWW::Mechanize Slow down requests to avoid HTTP Code 429

I've written a Perl script that will fetch and parse a webpage, fill some forms and collect some information, but after a while I was denied by the server with HTTP error 429 Too Many Requests. I sent too many requests in a short amount of time to the server so my IP has been blacklisted.

How could I "slow down" my requests/script to avoid this again and not hurt anyone? Is there any way to do this with Perl module WWW::Mechanize?

sub getlinksofall {

    for my $i ( 1 .. $maxpages ) {

        $mech->follow_link( url_regex => qr/page$i/i );
        push @LINKS, $mech->find_all_links(
            url_regex => qr/http:\/\/www\.example\.com\/somestuffs\//i
        );
    }

    foreach my $links (@LINKS) {
        push @LINKS2, $links->url();
    }

    @new_stuffs = uniq @LINKS2;
}

sub getnumberofpages {
    push @numberofpages, $mech->content =~ m/\/page(\d+)"/gi;
    $maxpages = ( sort { $b <=> $a } @numberofpages )[0];
}

sub getdataabout {

    foreach my $stuff ( @new_stuffs ) {

        $mech->get($stuff);

        $g = $mech->content;
        $t = $mech->content;
        $s = $mech->content;

        # ... and than some regex match with some DBI stuff...
    }
}

By these loops there could be thousands of links and I just want to slow it down. Is some "sleep" command in these loops enough for this?

Upvotes: 1

Views: 356

Answers (1)

Borodin
Borodin

Reputation: 126722

You need to check whether the site you are scraping has a service agreement that allows you to use it in this way. Because bandwidth costs money, most sites prefer to restrict access to real human operators or legitimate index engines like Google

You should also take a look at the robots.txt file for the site you're leeching which will have details on exactly what automated access is permitted. Take a look at www.robotstxt.org for more information

A simple sleep 30 between requests will probably be okay to get you past most rules, but don't reduce the period below 30

There is also a subclass of LWP::UserAgent called LWP::RobotUA that is intended for situations like this. It may well be straightforward to get WWW::Mechanize to use this instead of the base class

Upvotes: 2

Related Questions