Reputation: 650
I need some suggestion in parsing a html content,need to extract the id of tag <\a> inside a div, and store it into an variable specific variable. i have tried to make a regular expression for this but its getting the id of tag in all div. i need to store the ids of tag<\a> which is only inside a specific div .
The HTML content is
<div class="m_categories" id="part_one">
<ul>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10018">aaa</a>
</li>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10007">bbb</a>
</li>
.
.
.
</div>
<div class="m_categories hidden" id="part_two">
<ul>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10016">ccc</a>
</li>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10011">ddd</a>
</li>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10025">eee</a>
</li>
.
.
</div>
Need some suggestion, Thanks in advance
update: the regex i have used
if($content=~m/sel_cat " id="([^<]*?)"/is){}
while($content=~m/sel_cat " id="([^<]*?)"/igs){}
Upvotes: 0
Views: 164
Reputation: 57640
There are so many great HTML parser around. I kind of like the Mojo suite, which allows me to use CSS selectors to get a part of the DOM:
use Mojo;
my $dom = Mojo::DOM->new($html_content);
say for $dom->find('a.sel_cat')->all_text;
# Or, more robust:
# say $_->all_text for $dom->find('a.sel_cat')->each;
Output:
aaa
bbb
ccc
ddd
eee
Or for the IDs:
say for $dom->find('a.sel_cat')->attr('id');
# Or, more robust_
# say $_->attr('id') for $dom->find('a.sel_cat')->each;
Output:
sel_cat_10018
sel_cat_10007
sel_cat_10016
sel_cat_10011
sel_cat_10025
If you only want those ids in the part_two
div, use the selector #part_two a.sel_cat
.
Upvotes: 1
Reputation: 3709
You should really look into HTML::Parser rather than trying to use a regex to extract bits of HTML.
one way to us it to extract the id element from each div tag would be:
# This parser only looks at opening tags
sub start_handler {
my ($self, $tagname, $attr, $attrseq, $origtext) = @_;
if ($tagname eq 'div') { # is it a div element?
if($attr->{ id }) { # does div have an id?
print "div id found: ", $attr->{ id }, "\n";
}
}
}
my $html = &read_html_somehow() or die $!;
my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler );
$p->parse($html);
This is a lot more robust and flexible than a regex-based approach.
Upvotes: 2