Perl web scraper, retrieve data from text inside a script tag

Question

So far I was using perl to obtain data from web pages using HTML::TreeBuilder. This was OK when the data was contained inside meta or div tags; but now I stumbled upon a new structure that I don't know how to crawl, though it looks pretty trivial.

The example displays the relevant part of the content that I get from the web. I would like to get the values for units and horsePower.

Fragments of the code I was using so far:

use strict;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder;

[...]

$reply = $ua->get($url, @ns_headers);

# printing the reply would get us the first code snippet.
print $reply->content;

unless ($reply->is_success) {
    [...]
}

my $tree = HTML::TreeBuilder->new_from_content($reply->content);
my @unit_array = $tree -> look_down(_tag=>'meta','itemprop'=>'unit');
my $unit = $unit_array[0]->attr('content');

[...]

Any one knows how to obtain the relevant data and whether I should use something other than HTML::TreeBuilder for that matter? I haven't found any similar cases searching through stackoverflow and the web.

Stefan Becker · Accepted Answer

You are basically on the right path. But HTML::TreeBuilder doesn't understand anything about JavaScript.

The approach:

find the
Test run:
```
$ perl dummy.pl
$VAR1 = {
          'data' => {
                      'horsePower' => '100',
                      'units' => 'kW'
                    }
        };
FOUND: units: kW horsepower: 100
```

Perl web scraper, retrieve data from text inside a script tag

Answers (1)

Related Questions