gameover
gameover

Reputation: 12023

How to extract div tag

I'm trying to parse a html file and I want to extract everything inside a outer div tag with a unique id. Sample:

<body>
  ...
  <div id="1">

    <div id="2">
    ...
    </div>

    <div id="3">
    ...
    </div>

  </div>
  ...
</body>

Here I want to extract every thing in between <div id="1"> and its corresponding </tag> NOT the first </div> tag.

I've gone through many older posts but they don't work because they stop when they see the first </div> tag which is not what I'm looking for.

Any pointer would be appreciated.

Upvotes: 2

Views: 510

Answers (2)

Kenosis
Kenosis

Reputation: 6204

Quentin has rightly mentioned using an HTML parser to extract div content. Here's one option using Mojo::DOM:

use strict;
use warnings;
use Mojo::DOM;

my $text = <<END;
<body>
  ...
  <div id="1">
Under div id 1
    <div id="2">
Under div id 2
    </div>

    <div id="3">
Under div id 3
    </div>

  </div>
Outside the divs
</body>
END

my $dom = Mojo::DOM->new($text);

print $dom->find('div[id=1]')->pluck('text');

Output:

Under div id 1

Upvotes: 2

Quentin
Quentin

Reputation: 943650

It sounds like your problem is that you are trying to parse HTML using regular expressions.

Don't. Use an HTML parser. There are plenty on CPAN. I'm fond of HTML::TreeBuilder::XPath.

Upvotes: 7

Related Questions