Reputation: 79
I am trying to parse out a multiline string using perl but I am getting only the number of matches. Here is the sample of what I am parsing:
<div id="content-ZAJ9E" class="content">
Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>
I am trying to get the content to be stored in a string using this code:
@a = ($html =~ m/class="content">.*<\/div>/gs);
print "array A, size: ", @a+0, ", elements: ";
print join (" ", @a);
print "\n";
but it returns the whole thing not just the text in the div's. Can someone point me out the error in my regex?
Marisa
Upvotes: 1
Views: 1563
Reputation: 118158
Use something that is designed to parse HTML, such as HTML::TreeBuilder::XPath:
#!/usr/bin/env perl
use strict; use warnings;
use 5.014;
use HTML::TreeBuilder::XPath;
use YAML;
my $doc =<<EO_HTML;
<div id="content-ZAJ9E" class="content">
<!-- begin <div> -->
Wow, I love the new top bar, so much easier to navigate now :)
Anywho, got a few other fixes I am working on as well. :)
I hope you all like the new look.
<!-- end </div> -->
<span class="extra">Here I am</span>
</div>
EO_HTML
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->store_comments(1);
$tree->parse($doc);
print Dump [ $tree->findvalues('//div[@class="content"]') ];
print Dump [ $tree->findvalues('//*[@class="extra"]') ];
print Dump [ $tree->findvalues('//comment()') ];
Notice the ability provided by not relying on homebrewed regular expression patterns of dealing with various variations in input.
Output:
---
- ' Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. Here I am '
---
- Here I am
---
- ' begin <div> '
- ' end </div> '
Upvotes: 5
Reputation: 39158
Using a robust HTML parser:
use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div id="content-ZAJ9E" class="content">
Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>
HTML
$w->find('div.content')->text
expression returns Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
Upvotes: 7
Reputation: 54373
You are only matching the string, you are not parsing anything out. If you want the text in the middle of the div
, you should say:
$html =~ m/class="content">(.*)<\/div>/gs;
my $text = $1;
print $text;
Your match will be stored in the $1
variable. If there are multiple instances of such a div[class=content]
, you need a loop like this:
use strict; use warnings;
use Data::Dumper;
my $html = qq~<div id="content-ZAJ9E" class="content">
Wow, I love the new top bar.
</div>
<div id="content-ZAJ9E" class="content">
I still love it.
</div>
<div id="content-ZAJ9E" class="content">
I cant get enough!
</div>
~;
my @matches;
# *? makes it non-greedy so it will only match to the first </div>
while ($html =~ m/class="content">(.*?)<\/div>/gs){
my $group = $1;
$group =~ s/^\s+//; # strip whitespace at the beginning
$group =~ s/\s+$//; # and the end
push @matches, $group;
}
print Dumper \@matches;
I suggest you take a look at perlre
and perlretut
.
Some notes:
use strict
and use warnings
!Data::Dumper
, it's great to debug your variables.Upvotes: 4