Reputation: 12023
I'm trying to parse a html file and I want to extract everything inside a outer div tag with a unique id. Sample:
<body>
...
<div id="1">
<div id="2">
...
</div>
<div id="3">
...
</div>
</div>
...
</body>
Here I want to extract every thing in between <div id="1">
and its corresponding </tag>
NOT the first </div>
tag.
I've gone through many older posts but they don't work because they stop when they see the first </div>
tag which is not what I'm looking for.
Any pointer would be appreciated.
Upvotes: 2
Views: 510
Reputation: 6204
Quentin has rightly mentioned using an HTML parser to extract div
content. Here's one option using Mojo::DOM:
use strict;
use warnings;
use Mojo::DOM;
my $text = <<END;
<body>
...
<div id="1">
Under div id 1
<div id="2">
Under div id 2
</div>
<div id="3">
Under div id 3
</div>
</div>
Outside the divs
</body>
END
my $dom = Mojo::DOM->new($text);
print $dom->find('div[id=1]')->pluck('text');
Output:
Under div id 1
Upvotes: 2
Reputation: 943650
It sounds like your problem is that you are trying to parse HTML using regular expressions.
Don't. Use an HTML parser. There are plenty on CPAN. I'm fond of HTML::TreeBuilder::XPath
.
Upvotes: 7