Reputation: 598
I want to do two things:
1) count the number of times a given word appears in a text file
2) print out the context of that word
This is the code I am currently using:
my $word_delimiter = qr{
[^[:alnum:][:space:]]*
(?: [[:space:]]+ | -- | , | \. | \t | ^ )
[^[:alnum:]]*
}x;
my $word = "hello";
my $count = 0;
#
# here, a file's contents are loaded into $lines, code not shown
#
$lines =~ s/\R/ /g; # replace all line breaks with blanks (cannot just erase them, because this might connect words that should not be connected)
$lines =~ s/\s+/ /g; # replace all multiple whitespaces (incl. blanks, tabs, newlines) with single blanks
$lines = " ".$lines." "; # add a blank at beginning and end to ensure that first and last word can be found by regex pattern below
while ($lines =~ m/$word_delimiter$word$word_delimiter/g ) {
++$count;
# here, I would like to print the word with some context around it (i.e. a few words before and after it)
}
Three problems:
1) Is my $word_delimiter pattern catching all reasonable characters I can expect to separate words? Of course, I would not want to separate hyphenated words, etc. [Note: I am using UTF-8 throughout but only English and German text; and I understand what reasonably separates a word might be a matter of judgment]
2) When the file to be analzed contains text like "goodbye hello hello goodbye", the counter is incremented only once, because the regex only matches the first occurence of " hello ". After all, the second time it could find "hello", it is not preceeded by another whitespace. Any ideas on how to catch the second occurence, too? Should I maybe somehow reset pos()?
3) How to (reasonably efficiently) print out a few words before and after any matched word?
Thanks!
Upvotes: 2
Views: 775
Reputation: 37146
$word_delimiter
pattern catching all reasonable characters I can expect to separate words?\w
. It also matches digits and characters from non-roman scripts.\W
represents the negated sense (non-word characters).\b
represents a word boundary and has zero-length.Using these already available character classes should suffice.
Use zero-length word boundaries.
while ( $lines =~ /\b$word\b/g ) {
++$count;
}
Upvotes: 0