user3666137
user3666137

Reputation: 1

Perl HTML::Parser - search a for a specific string in a parsed file

I'm new to HTML::Parser for Perl.

I'm trying to parse a web page and then search for a specific string such as pass or fail. How might I go about that.

Due to framework issues I have to use the HTML::Parser base libary and not another module.

Snippet of code

#!/usr/bin/perl
use strict;

# define the subclass
package IdentityParse;

package HTMLStrip;
use base "HTML::Parser";

sub text {
  my ($self, $text) = @_;

  # just print out the original text
  print $text;
}

sub comment {
  my ($self, $comment) = @_;

  # print out original text with comment marker
  #print "hey hey";
}

sub end {
  my ($self, $tag, $origtext) = @_;

  # print out original text
  #print $origtext;
}

#my $p = new IdentityParse;
my $p    = new HTMLStrip;
my @file = $p->parse_file("testcase1.html");

if ($p->parse_file("testcase1.html") =~ "PASS") {
  print " The test passed \n";
}
else {
  print "\nthe test failed \n";
}

Upvotes: 0

Views: 216

Answers (1)

Borodin
Borodin

Reputation: 126732

If all you want is to strip the tags from the XML leaving just the text content, then you're making things too hard for yourself. All you need is a text handler subroutine that concatenates each text node to a global scalar.

It looks like this. I've edited the final string to change all spaces and newlines to a single space; otherwise there is a lot of space in there from the layout indents.

use strict;
use warnings;

use HTML::Parser;

my $parser = HTML::Parser->new( text_h => [\&text, 'dtext'] );

my $text_content;

sub text {
  $text_content .= shift;
}

$parser->parse_file(*DATA);
$text_content =~ s/\s+/ /g;
print $text_content;

__DATA__
<root>
  <item>
    Item 1
    status failed
  </item>
  <item>
    Item 2
    status passed
  </item>
  <item>
    Item 3
    status failed
  </item>
</root>

output

 Item 1 status failed Item 2 status passed Item 3 status failed  

Upvotes: 2

Related Questions