pOverlord
pOverlord

Reputation: 33

Perl web scraper, retrieve data from text inside a script tag

So far I was using perl to obtain data from web pages using HTML::TreeBuilder. This was OK when the data was contained inside meta or div tags; but now I stumbled upon a new structure that I don't know how to crawl, though it looks pretty trivial.

<html lang="en">
    <body>
        <script type="text/javascript">
            panel.web.bootstrapData = {
                "data": {
                    "units": "kW",
                    "horsePower": 100.00
                }
            };
        </script>
    </body>
</html>

The example displays the relevant part of the content that I get from the web. I would like to get the values for units and horsePower.

Fragments of the code I was using so far:

use strict;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder;

[...]

$reply = $ua->get($url, @ns_headers);

# printing the reply would get us the first code snippet.
print $reply->content;

unless ($reply->is_success) {
    [...]
}

my $tree = HTML::TreeBuilder->new_from_content($reply->content);
my @unit_array = $tree -> look_down(_tag=>'meta','itemprop'=>'unit');
my $unit = $unit_array[0]->attr('content');

[...]

Any one knows how to obtain the relevant data and whether I should use something other than HTML::TreeBuilder for that matter? I haven't found any similar cases searching through stackoverflow and the web.

Upvotes: 3

Views: 555

Answers (1)

Stefan Becker
Stefan Becker

Reputation: 5962

You are basically on the right path. But HTML::TreeBuilder doesn't understand anything about JavaScript.

The approach:

  • find the <script> nodes
  • extract the JSON content from those nodes
    • NOTE: this will be easy for the example given, but will require more finesse for more complicated <script> content
    • The escape \; in the regex isn't really required, but the SO syntax highlighter gets confused without it
  • use JSON to decode the string to Perl data structures
  • access those data structures in your script

A first rough solution without error checking. I left some debugging lines, commented out, in the code so that you can trace what each step is doing:

#!/usr/bin/perl
use strict;
use warnings;

use Data::Dumper;
use HTML::TreeBuilder;
use JSON;

my $decoder = new JSON;

my $tree       = HTML::TreeBuilder->new_from_file(\*DATA);
#$tree->dump;
my @scripts    = $tree->look_down(_tag => 'script');
#$scripts[0]->dump;
# NOTE 1: ->as_text() *DOES NOT* return <script> content!
# NOTE 2: ->as_HTML() probably doesn't work for all cases, i.e. escaping
my $javascript = ($scripts[0]->content_list())[0];
#print "${javascript}\n";
my($json)      = $javascript =~ /(\{.+\})\;/s;
#print "${json}\n";
my $object     = $decoder->decode($json);

print Dumper($object);
print "FOUND: units: ", $object->{data}->{units},
      " horsepower: ",  $object->{data}->{horsePower}, "\n";

# IMPORTANT: $tree needs to be destroyed by hand when you're done with it!
$tree->delete;

exit 0;

__DATA__
<html lang="en">
    <body>
        <script type="text/javascript">
            panel.web.bootstrapData = {
                "data": {
                    "units": "kW",
                    "horsePower": 100.00
                }
            };
        </script>
    </body>
</html>

Test run:

$ perl dummy.pl
$VAR1 = {
          'data' => {
                      'horsePower' => '100',
                      'units' => 'kW'
                    }
        };
FOUND: units: kW horsepower: 100

Upvotes: 2

Related Questions