Reputation: 79

How to parse multi-line HTML using regex in Perl

I am trying to parse out a multiline string using perl but I am getting only the number of matches. Here is the sample of what I am parsing:

<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>

I am trying to get the content to be stored in a string using this code:

@a = ($html =~ m/class="content">.*<\/div>/gs);
print "array A, size: ",  @a+0,  ", elements: ";
print join (" ", @a);
print "\n";

but it returns the whole thing not just the text in the div's. Can someone point me out the error in my regex?

Marisa

Upvotes: 1

Answers (3)

Sinan Ünür

Reputation: 118158

Use something that is designed to parse HTML, such as HTML::TreeBuilder::XPath:

#!/usr/bin/env perl

use strict; use warnings;
use 5.014;
use HTML::TreeBuilder::XPath;
use YAML;

my $doc =<<EO_HTML;
<div id="content-ZAJ9E" class="content">
<!-- begin <div> -->
        Wow, I love the new top bar, so much easier to navigate now :)
        Anywho, got a few other fixes I am working on as well. :)
        I hope you all like the new look.
<!-- end </div> -->
<span class="extra">Here I am</span>
</div>
EO_HTML

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->store_comments(1);
$tree->parse($doc);

print Dump [ $tree->findvalues('//div[@class="content"]') ];
print Dump [ $tree->findvalues('//*[@class="extra"]') ];
print Dump [ $tree->findvalues('//comment()') ];

Notice the ability provided by not relying on homebrewed regular expression patterns of dealing with various variations in input.

Output:

---
- '  Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. Here I am '
---
- Here I am
---
- ' begin <div> '
- ' end </div> '

Upvotes: 5

daxim

Reputation: 39158

Using a robust HTML parser:

use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>
HTML
$w->find('div.content')->text

expression returns Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.

Upvotes: 7

simbabque

Reputation: 54373

You are only matching the string, you are not parsing anything out. If you want the text in the middle of the div, you should say:

$html =~ m/class="content">(.*)<\/div>/gs;
my $text = $1;
print $text;

Your match will be stored in the $1 variable. If there are multiple instances of such a div[class=content], you need a loop like this:

use strict; use warnings;
use Data::Dumper;

my $html = qq~<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar.
</div>
<div id="content-ZAJ9E" class="content">
        I still love it.
</div>
<div id="content-ZAJ9E" class="content">
        I cant get enough!
</div>
~;

my @matches;
# *? makes it non-greedy so it will only match to the first </div>
while ($html =~ m/class="content">(.*?)<\/div>/gs){ 
  my $group = $1;     
  $group =~ s/^\s+//; # strip whitespace at the beginning
  $group =~ s/\s+$//; # and the end

  push @matches, $group;
}
print Dumper \@matches;

I suggest you take a look at perlre and perlretut.

Some notes:

Always use strict and use warnings!
Try Data::Dumper, it's great to debug your variables.
Using regex for HTML parsing is not the best idea. If you are doing a lot of parsing, consider one of the modules available at CPAN, such as HTML::Parser, HTML::TreeBuilder::XPath, or HTML::TokeParser::Simple, or Mojo::DOM, or search for it on SO

Upvotes: 4

How to parse multi-line HTML using regex in Perl

Answers (3)

Related Questions