user1769925
user1769925

Reputation: 598

Perl: Count number of times a word appears in text and print out surrounding words

I want to do two things:

1) count the number of times a given word appears in a text file

2) print out the context of that word

This is the code I am currently using:

my $word_delimiter = qr{
  [^[:alnum:][:space:]]*
  (?: [[:space:]]+ | -- | , | \. | \t | ^ )
  [^[:alnum:]]*
 }x;

my $word = "hello";
my $count = 0;

#
# here, a file's contents are loaded into $lines, code not shown
#

$lines =~ s/\R/ /g; # replace all line breaks with blanks (cannot just erase them, because this might connect words that should not be connected)
$lines =~ s/\s+/ /g; # replace all multiple whitespaces (incl. blanks, tabs, newlines) with single blanks
$lines = " ".$lines." "; # add a blank at beginning and end to ensure that first and last word can be found by regex pattern below

while ($lines =~ m/$word_delimiter$word$word_delimiter/g ) {
    ++$count;
    # here, I would like to print the word with some context around it (i.e. a few words before and after it)
}

Three problems:

1) Is my $word_delimiter pattern catching all reasonable characters I can expect to separate words? Of course, I would not want to separate hyphenated words, etc. [Note: I am using UTF-8 throughout but only English and German text; and I understand what reasonably separates a word might be a matter of judgment]

2) When the file to be analzed contains text like "goodbye hello hello goodbye", the counter is incremented only once, because the regex only matches the first occurence of " hello ". After all, the second time it could find "hello", it is not preceeded by another whitespace. Any ideas on how to catch the second occurence, too? Should I maybe somehow reset pos()?

3) How to (reasonably efficiently) print out a few words before and after any matched word?

Thanks!

Upvotes: 2

Views: 775

Answers (1)

Zaid
Zaid

Reputation: 37146

1. Is my $word_delimiter pattern catching all reasonable characters I can expect to separate words?

  • Word characters are denoted by the character class \w. It also matches digits and characters from non-roman scripts.
  • \W represents the negated sense (non-word characters).
  • \b represents a word boundary and has zero-length.

Using these already available character classes should suffice.

2. Any ideas on how to catch the second occurence, too?

Use zero-length word boundaries.

while ( $lines =~ /\b$word\b/g ) {
    
    ++$count;
}

Upvotes: 0

Related Questions