koku
koku

Reputation: 11

Regex to parse html for sentences?

I know that HTML:Parser is a thing and from reading around, I've realized that trying to parse html with regex is usually a suboptimal way of doing things, but for a Perl class I'm currently trying to use regular expressions (hopefully just a single match) to identify and store the sentences from a saved html doc. Eventually I want to be able to calculate the number of sentences, words/sentence and hopefully average length of words on the page.

For now, I've just tried to isolate things which follow ">" and precede a ". " just to see what if anything it isolates, but I can't get the code to run, even when manipulating the regular expression. So I'm not sure if the issue is in the regex, somewhere else or both. Any help would be appreciated!

#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;

open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;

print "<pre>";

###Main Program###
&sentences;

###sentence identifier sub###

sub sentences {
@sentences;
while ($html =~ />[^<]\. /gis) {
    push @sentences, $1;
}
#for debugging, comment out when running    
    print join("\n",@sentences);
}

print "</pre>";

Upvotes: 1

Views: 562

Answers (3)

mostly_perl_guy
mostly_perl_guy

Reputation: 26

I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).

#!/usr/bin/perl

 use strict;
 use warnings;
 use HTML::Grabber;

 my $file_location = shift;
 print "\n\nfile: $file_location";
 my $totalWordCount = 0;
 my $sentenceCount = 0;
 my $wordsInSentenceCount = 0;
 my $averageWordsPerSentence = 0;
 my $char_count = 0;
 my $contents;
 my $rounded;
 my $rounded2;

 open ( my $file, '<', $file_location  ) or die "cannot open < file: $!";

    while( my $line = <$file>){
          $contents .= $line;
  }      
 close( $file );
 my $dom = HTML::Grabber->new( html => $contents );

 $dom->find('p')->each( sub{
    my $p_tag = $_->text;

    ++$totalWordCount while $p_tag =~ /\S+/g;


    while ($p_tag =~ /[.!?]+/g){
              $p_tag =~ s/\s//g;
              $char_count += (length($p_tag));
              $sentenceCount++;  
          }
     });     


           print "\n Total Words: $totalWordCount\n";
           print " Total Sentences: $sentenceCount\n";
           $rounded = $totalWordCount / $sentenceCount;
           print  " Average words per sentence: $rounded.\n\n";
           print " Total Characters: $char_count.\n";
           my $averageCharsPerWord = $char_count / $totalWordCount  ;

           $rounded2 = sprintf("%.2f", $averageCharsPerWord );

           print  " Average words per sentence: $rounded2.\n\n";

Upvotes: 0

mirod
mirod

Reputation: 16171

A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)

This does not get all the sentences though, just the first one in each element.

A better way would be to capture all the text, then extract sentences from each fragment

while( $html=~ m{>([^<]*<}g) { push @text_content, $1}; 
foreach (@text_content) { while( m{([^.]*)\.}gs) { push @sentences, $1; } }

(untested because it's early in the morning and coffee is calling)

All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.

Upvotes: 2

Eli Algranti
Eli Algranti

Reputation: 9007

Your regex should be />[^<]*?./gis

The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.

There may be other problems.

Now read this

Upvotes: 3

Related Questions