Reputation: 33
So far I was using perl to obtain data from web pages using HTML::TreeBuilder
. This was OK when the data was contained inside meta
or div
tags; but now I stumbled upon a new structure that I don't know how to crawl, though it looks pretty trivial.
<html lang="en">
<body>
<script type="text/javascript">
panel.web.bootstrapData = {
"data": {
"units": "kW",
"horsePower": 100.00
}
};
</script>
</body>
</html>
The example displays the relevant part of the content that I get from the web. I would like to get the values for units
and horsePower
.
Fragments of the code I was using so far:
use strict;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder;
[...]
$reply = $ua->get($url, @ns_headers);
# printing the reply would get us the first code snippet.
print $reply->content;
unless ($reply->is_success) {
[...]
}
my $tree = HTML::TreeBuilder->new_from_content($reply->content);
my @unit_array = $tree -> look_down(_tag=>'meta','itemprop'=>'unit');
my $unit = $unit_array[0]->attr('content');
[...]
Any one knows how to obtain the relevant data and whether I should use something other than HTML::TreeBuilder
for that matter? I haven't found any similar cases searching through stackoverflow and the web.
Upvotes: 3
Views: 555
Reputation: 5962
You are basically on the right path. But HTML::TreeBuilder doesn't understand anything about JavaScript.
The approach:
<script>
nodes<script>
content\;
in the regex isn't really required, but the SO syntax highlighter gets confused without itA first rough solution without error checking. I left some debugging lines, commented out, in the code so that you can trace what each step is doing:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use HTML::TreeBuilder;
use JSON;
my $decoder = new JSON;
my $tree = HTML::TreeBuilder->new_from_file(\*DATA);
#$tree->dump;
my @scripts = $tree->look_down(_tag => 'script');
#$scripts[0]->dump;
# NOTE 1: ->as_text() *DOES NOT* return <script> content!
# NOTE 2: ->as_HTML() probably doesn't work for all cases, i.e. escaping
my $javascript = ($scripts[0]->content_list())[0];
#print "${javascript}\n";
my($json) = $javascript =~ /(\{.+\})\;/s;
#print "${json}\n";
my $object = $decoder->decode($json);
print Dumper($object);
print "FOUND: units: ", $object->{data}->{units},
" horsepower: ", $object->{data}->{horsePower}, "\n";
# IMPORTANT: $tree needs to be destroyed by hand when you're done with it!
$tree->delete;
exit 0;
__DATA__
<html lang="en">
<body>
<script type="text/javascript">
panel.web.bootstrapData = {
"data": {
"units": "kW",
"horsePower": 100.00
}
};
</script>
</body>
</html>
Test run:
$ perl dummy.pl
$VAR1 = {
'data' => {
'horsePower' => '100',
'units' => 'kW'
}
};
FOUND: units: kW horsepower: 100
Upvotes: 2