Mindaugas Bernatavičius
Mindaugas Bernatavičius

Reputation: 3909

Perl HTML::LinkExtractor returns different links on different calls

I'm trying to extract all the images from a HTML document (downloaded from the web and turned into a string (scalar)), and I'm using HTML::LinkExtractor cpan lib.

I'm passing the same HTML, but getting different links extracted.

Question: why is that the case and how can I fix this?

Code:

my $LX = new HTML::LinkExtractor();
# print($_[0] . "\n\n"); <--- Prints the same HTML document every time
$LX->parse(\$_[0]);

for my $p ( @{$LX->links()} ){
    # Need to iterate though all the
    # values, since images can be hidden
    # in _TEXT w/o any img tag, etc.
    foreach (my( $key, $val ) = each $p) {
        print($key . "--->" . $val . "\n"); <--- Prints different values

First output:

$ ./HTMLPictureScraper.pl http://dustyfeet.com/
/--->/
/--->/
href--->http://dustyfeetonline.com
href--->http://dustyfeetonline.com
target--->_top
target--->_top
href--->http://www.nytimes.com/2006/08/28/technology/28link.html?scp=6&sq=%22stuart%20frankel%22&st=cse
href--->http://www.nytimes.com/2006/08/28/technology/28link.html?scp=6&sq=%22stuart%20frankel%22&st=cse
target--->_top
target--->_top
tag--->a
tag--->a
href--->./evil/evil.html
href--->./evil/evil.html
_TEXT---><a
 href="./pangan/index.html">Warung Seniman</a>
_TEXT---><a
 href="./pangan/index.html">Warung Seniman</a>
href--->./santanyi_registration.html
href--->./santanyi_registration.html
href--->mailto:[email protected]
href--->mailto:[email protected]

Second output:

$ ./HTMLPictureScraper.pl http://dustyfeet.com/
content--->1vLCRPR1SHmiCICnhWfD7jtpOOSHe79iILqzDkGBUg0=
content--->1vLCRPR1SHmiCICnhWfD7jtpOOSHe79iILqzDkGBUg0=
tag--->a
tag--->a
href--->notuncnj.html
href--->notuncnj.html
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
tag--->a
href--->mailto:[email protected]
href--->mailto:[email protected]

Upvotes: 0

Views: 95

Answers (1)

amon
amon

Reputation: 57640

In this line, you are trying to combine an each-iterator with a for-each loop. Despite their similar names, those are incompatible:

foreach (my( $key, $val ) = each $p) {
    print($key . "--->" . $val . "\n");
}

This gets the next key-value item from %$p's iterator, and assigns the two-item list ($key, $val). Then, the foreach loops over these two items. That's why you always see the same two values twice. Because the order of iteration with each is undefined, you only see a random entry from the %$p hash.

To fix this:

Either, use a while-loop to use the each-iterator:

while (my ($key, $val) = each %$p) {
    print "$key--->$val\n";
}

Or, use a foreach loop over the keys:

for my $key (keys %$p) {
    my $val = $p->{$key};
    print "$key--->$val\n";
}

I prefer the for/foreach loop because this allows us to sort the keys in a stable order, instead of relying on the undefined iteration order of a hash:

for my $key (sort keys %$p) {
    my $val = $p->{$key};
    print "$key--->$val\n";
}

This should then always produce identical output for identical input documents.

As zdim noted in their answer, you should not pass scalars like $p to operators like keys or each, but should dereference it to a hash like each %$p. Otherwise, your code will not work on up to date versions of Perl.

Upvotes: 3

Related Questions