Reputation: 7
I've written a Perl script that will fetch and parse a webpage, fill some forms and collect some information, but after a while I was denied by the server with HTTP error 429 Too Many Requests
. I sent too many requests in a short amount of time to the server so my IP has been blacklisted.
How could I "slow down" my requests/script to avoid this again and not hurt anyone? Is there any way to do this with Perl module WWW::Mechanize
?
sub getlinksofall {
for my $i ( 1 .. $maxpages ) {
$mech->follow_link( url_regex => qr/page$i/i );
push @LINKS, $mech->find_all_links(
url_regex => qr/http:\/\/www\.example\.com\/somestuffs\//i
);
}
foreach my $links (@LINKS) {
push @LINKS2, $links->url();
}
@new_stuffs = uniq @LINKS2;
}
sub getnumberofpages {
push @numberofpages, $mech->content =~ m/\/page(\d+)"/gi;
$maxpages = ( sort { $b <=> $a } @numberofpages )[0];
}
sub getdataabout {
foreach my $stuff ( @new_stuffs ) {
$mech->get($stuff);
$g = $mech->content;
$t = $mech->content;
$s = $mech->content;
# ... and than some regex match with some DBI stuff...
}
}
By these loops there could be thousands of links and I just want to slow it down. Is some "sleep" command in these loops enough for this?
Upvotes: 1
Views: 356
Reputation: 126722
You need to check whether the site you are scraping has a service agreement that allows you to use it in this way. Because bandwidth costs money, most sites prefer to restrict access to real human operators or legitimate index engines like Google
You should also take a look at the robots.txt
file for the site you're leeching which will have details on exactly what automated access is permitted. Take a look at www.robotstxt.org for more information
A simple sleep 30
between requests will probably be okay to get you past most rules, but don't reduce the period below 30
There is also a subclass of LWP::UserAgent
called LWP::RobotUA
that is intended for situations like this. It may well be straightforward to get WWW::Mechanize
to use this instead of the base class
Upvotes: 2