Reputation: 1
I'm new to HTML::Parser
for Perl.
I'm trying to parse a web page and then search for a specific string such as pass
or fail
. How might I go about that.
Due to framework issues I have to use the HTML::Parser
base libary and not another module.
Snippet of code
#!/usr/bin/perl
use strict;
# define the subclass
package IdentityParse;
package HTMLStrip;
use base "HTML::Parser";
sub text {
my ($self, $text) = @_;
# just print out the original text
print $text;
}
sub comment {
my ($self, $comment) = @_;
# print out original text with comment marker
#print "hey hey";
}
sub end {
my ($self, $tag, $origtext) = @_;
# print out original text
#print $origtext;
}
#my $p = new IdentityParse;
my $p = new HTMLStrip;
my @file = $p->parse_file("testcase1.html");
if ($p->parse_file("testcase1.html") =~ "PASS") {
print " The test passed \n";
}
else {
print "\nthe test failed \n";
}
Upvotes: 0
Views: 216
Reputation: 126732
If all you want is to strip the tags from the XML leaving just the text content, then you're making things too hard for yourself. All you need is a text handler subroutine that concatenates each text node to a global scalar.
It looks like this. I've edited the final string to change all spaces and newlines to a single space; otherwise there is a lot of space in there from the layout indents.
use strict;
use warnings;
use HTML::Parser;
my $parser = HTML::Parser->new( text_h => [\&text, 'dtext'] );
my $text_content;
sub text {
$text_content .= shift;
}
$parser->parse_file(*DATA);
$text_content =~ s/\s+/ /g;
print $text_content;
__DATA__
<root>
<item>
Item 1
status failed
</item>
<item>
Item 2
status passed
</item>
<item>
Item 3
status failed
</item>
</root>
output
Item 1 status failed Item 2 status passed Item 3 status failed
Upvotes: 2