Reputation: 3909
I'm trying to extract all the images from a HTML document (downloaded from the web and turned into a string (scalar)), and I'm using HTML::LinkExtractor cpan lib.
I'm passing the same HTML, but getting different links extracted.
Question: why is that the case and how can I fix this?
Code:
my $LX = new HTML::LinkExtractor();
# print($_[0] . "\n\n"); <--- Prints the same HTML document every time
$LX->parse(\$_[0]);
for my $p ( @{$LX->links()} ){
# Need to iterate though all the
# values, since images can be hidden
# in _TEXT w/o any img tag, etc.
foreach (my( $key, $val ) = each $p) {
print($key . "--->" . $val . "\n"); <--- Prints different values
First output:
$ ./HTMLPictureScraper.pl http://dustyfeet.com/
/--->/
/--->/
href--->http://dustyfeetonline.com
href--->http://dustyfeetonline.com
target--->_top
target--->_top
href--->http://www.nytimes.com/2006/08/28/technology/28link.html?scp=6&sq=%22stuart%20frankel%22&st=cse
href--->http://www.nytimes.com/2006/08/28/technology/28link.html?scp=6&sq=%22stuart%20frankel%22&st=cse
target--->_top
target--->_top
tag--->a
tag--->a
href--->./evil/evil.html
href--->./evil/evil.html
_TEXT---><a
href="./pangan/index.html">Warung Seniman</a>
_TEXT---><a
href="./pangan/index.html">Warung Seniman</a>
href--->./santanyi_registration.html
href--->./santanyi_registration.html
href--->mailto:[email protected]
href--->mailto:[email protected]
Second output:
$ ./HTMLPictureScraper.pl http://dustyfeet.com/
content--->1vLCRPR1SHmiCICnhWfD7jtpOOSHe79iILqzDkGBUg0=
content--->1vLCRPR1SHmiCICnhWfD7jtpOOSHe79iILqzDkGBUg0=
tag--->a
tag--->a
href--->notuncnj.html
href--->notuncnj.html
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
href--->mailto:[email protected]
href--->mailto:[email protected]
Upvotes: 0
Views: 95
Reputation: 57640
In this line, you are trying to combine an each
-iterator with a for-each loop. Despite their similar names, those are incompatible:
foreach (my( $key, $val ) = each $p) {
print($key . "--->" . $val . "\n");
}
This gets the next key-value item from %$p
's iterator, and assigns the two-item list ($key, $val)
. Then, the foreach
loops over these two items. That's why you always see the same two values twice. Because the order of iteration with each
is undefined, you only see a random entry from the %$p
hash.
To fix this:
Either, use a while-loop to use the each
-iterator:
while (my ($key, $val) = each %$p) {
print "$key--->$val\n";
}
Or, use a foreach loop over the keys:
for my $key (keys %$p) {
my $val = $p->{$key};
print "$key--->$val\n";
}
I prefer the for/foreach loop because this allows us to sort the keys in a stable order, instead of relying on the undefined iteration order of a hash:
for my $key (sort keys %$p) {
my $val = $p->{$key};
print "$key--->$val\n";
}
This should then always produce identical output for identical input documents.
As zdim noted in their answer, you should not pass scalars like $p
to operators like keys
or each
, but should dereference it to a hash like each %$p
. Otherwise, your code will not work on up to date versions of Perl.
Upvotes: 3