Reputation: 5565
So it seemed easy enough. Use a series of nested loops to go though a ton of URLs sorted by year/month/day and download the XML files. As this is my first script, I started with the loop; something familiar in any language. I ran it just printing the constructed URLs and it worked perfect.
I then wrote the code to download the content and save it separately, and that worked perfect as well with a sample URL on multiple test cases.
But when I combined these two bits of code, it broke, the program just got stuck and did nothing at all.
I therefore ran the debugger and as I stepped through it, it became stuck on this one line:
warnings::register::import(/usr/share/perl/5.10/warnings/register.pm:25):25:vec($warnings::Bits{$k}, $warnings::LAST_BIT, 1) = 0;
If I just hit r to return from the subroutine it works and continues to another point on its way back down the call stack where something similar happens over and over for some time. The stack trace:
warnings::register::import('warnings::register') called from file `/usr/lib/perl/5.10/Socket.pm' line 7
Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'Socket.pm' called from file `/usr/lib/perl/5.10/IO/Socket.pm' line 12
IO::Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'IO/Socket.pm' called from file `/usr/share/perl5/LWP/Simple.pm' line 158
LWP::Simple::_trivial_http_get('www.aDatabase.com', 80, '/sittings/1987/oct/20.xml') called from file `/usr/share/perl5/LWP/Simple.pm' line 136
LWP::Simple::_get('http://www.aDatabase.com/1987/oct/20.xml') called from file `xmlfetch.pl' line 28
As you can see it is getting stuck inside this "get($url)" method, and I have no clue why? Here is my code:
#!/usr/bin/perl
use LWP::Simple;
$urlBase = 'http://www.aDatabase.com/subheading/';
$day=1;
$month=1;
@months=("list of months","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec");
$year=1987;
$nullXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<nil-classes type=\"array\"/>\n";
while($year<=2006)
{
$month=1;
while($month<=12)
{
$day=1;
while($day<=31)
{
$newUrl = "$urlBase$year/$months[$month]/$day.xml";
$content = get($newUrl);
if($content ne $nullXML)
{
$filename = "$year-$month-$day.xml";
open(FILE, ">$filename");
print FILE $content;
close(FILE);
}
$day++;
}
$month++;
}
$year++;
}
I am almost positive it is something tiny I just dont know, but google has not turned up anything.
EDIT: It's official, it just hangs forever inside this get method, runs for several loops then hangs again for a while. But its still a problem. Why is this happening?
Upvotes: 2
Views: 7255
Reputation: 132802
LWP has a getstore
function that does most of the fetching then saving work for you. You might also check out LWP::Parallel::UserAgent and a bit more control over how you hit the remote site.
Upvotes: 0
Reputation: 414207
(2006 - 1986) * 12 * 31
is more then 7000. Requesting web pages without a pause is not nice.
Slightly more Perl-like version (code-style wise):
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple qw(get);
my $urlBase = 'http://www.example.com/subheading/';
my @months = qw/jan feb mar apr may jun jul aug sep oct nov dec/;
my $nullXML = <<'NULLXML';
<?xml version="1.0" encoding="UTF-8"?>
<nil-classes type="array"/>
NULLXML
for my $year (1987..2006) {
for my $month (0..$#months) {
for my $day (1..31) {
my $newUrl = "$urlBase$year/$months[$month]/$day.xml";
my $content = "abc"; #XXX get($newUrl);
if ($content ne $nullXML) {
my $filename = "$year-@{[$month+1]}-$day.xml";
open my $fh, ">$filename"
or die "Can't open '$filename': $!";
print $fh $content;
# $fh implicitly closed
}
}
}
}
Upvotes: 2
Reputation: 6622
Since http://www.adatabase.com/1987/oct/20.xml is a 404 (and isn't something that can be generated from your program anyway (no 'subheading' in the path), I'm assuming that isn't the real link you are using, which makes it hard for us to test. As a general rule, please use example.com instead of making up hostnames, that's why it is reserved.
You should really
use strict;
use warnings;
in your code - this will help highlight any scoping issues you may have (I'd be surprised if it was the case, but there is a chance that a part of the LWP code is messing around with your $urlBase or something). I think it should be enough to change the inital variable declarations (and $newUrl, $content and $filename) to put 'my' in front to make your code strict.
If using strict and warnings doesn't get you any closer to a solution, you could warn out the link you are about to use each loop so when it sticks you can try it in a browser and see what happens, or alternatively using a packet sniffer (such as Wireshark) could give you some clues.
Upvotes: 3