Reputation: 185
I wrote the following code to scrape text content between <div id=aaa-bbb>
and the next </div>
tag, but it only prints out the whole HTML source.
use LWP::Simple;
$url = 'http://domain.com/?xxxxxxx';
my $content = get($url);
$data =~ m/<div id="aaa-bbb">(.*?)<\/div>/g;
if (is_success(getprint($url))) {
print $_;
}
# or using the following line directly without if statement
print $data;
The HTML piece that I'm interested in looks like this:
<div id="aaa-bbb">
<p>text text text text text text text text text</p><p>text text text</p>
</div>
That specific div
tag id appears only once in the whole HTML document.
I'm also looking to strip out <p></p>
tags or tidy the output by line breaks for storing as a text file later or reusing.
After reading your valuable comments I tried using
WWW::Mechanize
and
WWW::Mechanize::TreeBuilder
instead, like this
use strict;
use warnings;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new;
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get( 'domain.com/?xxxxxx' );
my @list = $mech->find('div id="aaa-bbb"'); # or <div id="aaa-bbb"> or "<div id="aaa-bbb">"
foreach (@list) {
print $_->as_text();
}
It works for simple tags but can't get it to work with <div id="aaaa">
. It just exits the script without printing anything. I used double and single quotes, it already has double quotes inside the tag id.
Upvotes: 0
Views: 2347
Reputation: 20280
This type of parsing is much easier with a DOM parser. My parser of choice is Mojo::DOM which is part of the Mojolicious suite.
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get( 'domain.com/?xxxxxx' )->res->dom;
my $text = $dom->at('#aaa-bbb')->all_text;
The at
method is a special case of the find
method, which finds all the instances; at
finds the first (or in your case, only). The #
is the CSS selector syntax for ids.
Upvotes: 5