Reputation: 873
I'm trying to parse an HTML file to count HTML tags. I'm not much familiar with Regexp though.
My current code counts only by line. not tag by tag. It returns the whole line.
while(<SUB>){
while(/(<[^\/][a-z].*>)/gi){
print $_;
$count++;
}
}
suppose that we have a line like this in the file
<div>blahblahblah</div><h1>hello</h1><p>blah</>
I need to extract the opening tag of every HTML tag and also tags like <hr>
,<br>
and <img>
.
Could you please put me in the right direction.
Upvotes: 2
Views: 665
Reputation: 4868
If you want to count HTML tags within a document I suggest that you use HTML::Treebuilder.
use strict;
use HTML::Tree;
use LWP::Simple;
my $ex = "http://www.google.com";
my $content = get($ex);
my $tree = HTML::Tree->new();
$tree->parse($content);
my @a_tags = $tree->look_down( '_tag' , 'div' );
my $size=@a_tags;
print $size;
Now you can specify different tag names instead of div and count all different tags that you require. I suggest studying HTML::Treebuilder as it is a very useful module and you may finds methods you may find useful.
Upvotes: 2