StevieD
StevieD

Reputation: 7443

Web::Scraper Cannot find <link> or <meta> elements in the <body> of an HTML document

I've been staring a this for an hour now and I'm throwing in the towel.

I am attempting to scrape some data from a web page. Here's a snippet with some of the data I'm trying to extract:

<span itemprop="thumbnail" itemscope itemtype="http://schema.org/ImageObject">
  <link itemprop="url" href="http://blahblah.org/video/thumbnail_23432230.jpg">
  <meta itemprop="width" content="1280">
  <meta itemprop="height" content="720">
</span>

I want to grab the value of the href property form the tag with the Web::Scraper module. Here's the relevant perl code:

my $div = scraper {
  process 'span[itemprop="thumbnail"] > link', url => '@href';
};
my $res = $div->scrape( $html );
$url = $res->{url};

No matter what I try, $url returns undefined. I'm using version .36 of the Web::Scraper module.

Upvotes: 2

Views: 276

Answers (1)

Borodin
Borodin

Reputation: 126742

This is because of a bug in HTML::TreeBuilder::XPath. It has a naive understanding of <link> and <meta> elements, insisting that they belong only in the <head> element, even if they have an itemprop attribute.

The way elements are treated is based on the hashes in HTML::Tagset, and a fix of sorts can be effected by hacking this data.

If you add this to the top of your program

use HTML::Tagset;

for (qw/ link meta /) {
    $HTML::Tagset::isHeadElement{$_}       = 0;
    $HTML::Tagset::isHeadOrBodyElement{$_} = 1;
}

then it "fixes" the specific situation in your question, but of course a proper solution should take account of the itemprop attributes as well as the tags.

Upvotes: 7

Related Questions