Reputation: 3060
I am using a perl script to remove all stopwords in a text. The stop words are stored one by line. I am using Mac OSX command line and perl is installed correctly.
This script is not working properly and has a boundary problem.
#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;
# poor man's argument handler
open(WORDS, shift @ARGV) || die "failed to open words file: $!";
open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!";
my @words;
# get all words into an array
while ($_=<WORDS>) {
chop; # strip eol
push @words, split; # break up words on line
}
# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
@words=sort { length($b) <=> length($a) } @words;
# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;
# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (@words) {
$text =~ s/\b\Q$word\E\s?//sg;
}
# output "fixed" text
print $text;
$ cat sample.txt
how about i decide to look at it afterwards what
across do you think is it a good idea to go out and about i
think id rather go up and above
I
a
about
an
are
as
at
be
by
com
for
from
how
in
is
it
..
$ ./remove.pl stopwords.txt sample.txt
i decide look fterwards cross do you think good idea go out d i
think id rather go up d bove
As you can see, it replaces afterwards using a as fterwards. Think its a regex problem. Please can somebody help me to patch this quickly? Thanks for all the help :J
Upvotes: 2
Views: 211
Reputation: 1475
Your regex is not strict enough.
$text =~ s/\b\Q$word\E\s?//sg;
When $word
is a
, the command is effectively s/\ba\s?//sg
. This means, remove all occurrences of a new word starting with a
followed by zero or more whitespace. In afterwards
, this will successfully match the first a
.
You can make the match more stricter by ending word with another \b
. Like
$text =~ s/\b\Q$word\E\b\s?//sg;
Upvotes: 0
Reputation: 80639
Use word-boundary on both sides of your $word
. Currently, you are only checking for it at the beginning.
You won't need the \s?
condition with the \b
in place:
$text =~ s/\b\Q$word\E\b//sg;
Upvotes: 1