SilverShadow
SilverShadow

Reputation: 185

Getting html content between specific <div> tag only

I wrote the following code to scrape text content between <div id=aaa-bbb> and the next </div> tag, but it only prints out the whole HTML source.

use LWP::Simple;

$url = 'http://domain.com/?xxxxxxx';

my $content = get($url);

$data =~ m/<div id="aaa-bbb">(.*?)<\/div>/g;

if (is_success(getprint($url))) {
    print $_;
 }

# or using the following line directly without if statement
print $data;

The HTML piece that I'm interested in looks like this:

<div id="aaa-bbb">
<p>text text text text text text text text text</p><p>text text text</p>
</div>

That specific div tag id appears only once in the whole HTML document.

I'm also looking to strip out <p></p> tags or tidy the output by line breaks for storing as a text file later or reusing.

After reading your valuable comments I tried using WWW::Mechanize and WWW::Mechanize::TreeBuilder instead, like this

use strict;
use warnings;

use WWW::Mechanize; 
use WWW::Mechanize::TreeBuilder; 

my $mech = WWW::Mechanize->new; 
WWW::Mechanize::TreeBuilder->meta->apply($mech); 

$mech->get( 'domain.com/?xxxxxx' ); 

my @list = $mech->find('div id="aaa-bbb"'); # or <div id="aaa-bbb"> or "<div id="aaa-bbb">"
foreach (@list) { 
  print $_->as_text(); 
} 

It works for simple tags but can't get it to work with <div id="aaaa">. It just exits the script without printing anything. I used double and single quotes, it already has double quotes inside the tag id.

Upvotes: 0

Views: 2347

Answers (1)

Joel Berger
Joel Berger

Reputation: 20280

This type of parsing is much easier with a DOM parser. My parser of choice is Mojo::DOM which is part of the Mojolicious suite.

#!/usr/bin/env perl

use strict;
use warnings;

use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;

my $dom = $ua->get( 'domain.com/?xxxxxx' )->res->dom; 

my $text = $dom->at('#aaa-bbb')->all_text;

The at method is a special case of the find method, which finds all the instances; at finds the first (or in your case, only). The # is the CSS selector syntax for ids.

Upvotes: 5

Related Questions