Reputation: 1215
Friends
I have a nice script that works as a image scraper. For the first trials and tests all goes well.
Here is a list of URLs that I use in urls.txt that I am running against the script. Note this is only a short list. I need to run against 2500 URLs so it would be great if the script is a bit more robust and would continue to run if some URLs are not available or take too much time to get. I think that the script is running into some problems if some URLs are not available or take too much time or do block mozrepl and WWW:Mechanize::Firefox takes too much time.
Do you think that my ideas and suggestions are probably the cause of the issue or not? If so how can we improve the script and make it stronger and more powerful and robust so that it does not stop too soon.
Love to hear from you.
Greetings.
http://www.bez-zofingen.ch
http://www.schulesins.ch
http://www.schulen-turgi.ch/pages/bezirksschule/startseite.php
http://www.schinznach-dorf.ch
http://www.schule-seengen.ch
http://www.gilgenberg.ch/schule/bez/2005-06/
http://www.rheinfelden-schulen.ch/bezirksschule/
http://www.bezmuri.ch
http://www.moehlin.ch/schulen/
http://www.schule-mewo.ch
http://www.bez-frick.ch
http://www.bezendingen.ch
http://www.bezbrugg.ch
http://www.schule-bremgarten.ch/content/view/20/37/
http://www.bez-balsthal.ch
http://www.schule-baden.ch
http://bezaarau.educanet2.ch/info/.ws_gen/index.htm
http://www.benedict-basel.ch
http://www.institut-beatenberg.ch/
http://www.schulewilchingen.ch
http://www.ksuo.ch
http://www.international-school.ch
http://www.vsgtaegerwilen.ch/
http://www.vgk.ch/
http://www.vstb.ch
But I guess that I would be very happy if it is more robust than now.
Sure thing it is driving a real browser as with WWW::Mechanize::Firefox
So somewhere it might be somewhat unstable, perhaps more than any other screen-scraping solution. I am getting some errors like the following... (see below)
Note I also had a closer look at the debugging pages at Firefox Troubleshooting with its hints and tricks and workarounds regarding various bugs, troubles and things like that.
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = new WWW::Mechanize::Firefox();
open my $URLs, '<', 'URLs.txt' or die $!;
while (<$URLs>) {
chomp;
next unless /^http/I;
print "$_\n";
$mech->get($_);
my $png = $mech->content_as_png;
my $name = $_;
$name =~ s#^http://##I;
$name =~ s#/##g;
$name =~ s/\s+\z//;
$name =~ s/\A\s+//;
$name =~ s/^www\.//;
$name .= ".png";
open(my $out, '>', "/home/martin/images/$name") or die $!;
binmode $out;
print $out $png;
close $out;
sleep 5;
}
See the results and also the errors where it stops.
martin@linux-wyee:~/perl> perl test_10.pl
http://www.bez-zofingen.ch
Datei oder Verzeichnis nicht gefunden at test_10.pl line 24, <$URLs> line 3.
martin@linux-wyee:~/perl> perl test_10.pl
http://www.bez-zofingen.ch
http://www.schulesins.ch
http://www.schulen-turgi.ch/pages/bezirksschule/startseite.php
http://www.schinznach-dorf.ch
http://www.schule-seengen.ch
http://www.gilgenberg.ch/schule/bez/2005-06/
http://www.rheinfelden-schulen.ch/bezirksschule/
Not Found at test_10.pl line 15
martin@linux-wyee:~/perl>
What do you suggest? How can we make the script a bit more robust? How to get it so that it does not stop so early?
Upvotes: 0
Views: 317
Reputation: 51
You should always check that your response is success or not. Your corrected code :
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = new WWW::Mechanize::Firefox();
open my $URLs, '<', 'URLs.txt' or die $!;
while (<$URLs>) {
chomp;
next unless /^http/I;
print "$_\n";
my $res = $mech->get($_);
if(!$res->is_success()){
next; # or continue;
}
my $png = $mech->content_as_png;
my $name = $_;
$name =~ s#^http://##I;
$name =~ s#/##g;
$name =~ s/\s+\z//;
$name =~ s/\A\s+//;
$name =~ s/^www\.//;
$name .= ".png";
open(my $out, '>', "/home/martin/images/$name") or die $!;
binmode $out;
print $out $png;
close $out;
sleep 5;
}
`
Upvotes: 0
Reputation: 39158
Wrap all method/system calls that can go wrong in an exception handler. (See chapter 13 of Perl Best Practices for a discussion of the topic.) Set explicit timeouts for Mozrepl.
When you get an error, log it, and skip ahead to the next URL. When the run is done, inspect the log file and repeat the run with those URLs that could not be handled earlier. Sort out the URLs for pages that are permanently down. Finally, a few pages might be left that cannot be screenshot through Mozrepl for some reason. Handle those manually.
Upvotes: 3