pbu
pbu

Reputation: 3060

Perl removing words from file1 with file2

I am using a perl script to remove all stopwords in a text. The stop words are stored one by line. I am using Mac OSX command line and perl is installed correctly.

This script is not working properly and has a boundary problem.

#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;

# poor man's argument handler
open(WORDS, shift @ARGV) || die "failed to open words file: $!";
open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!";

my @words;
# get all words into an array
while ($_=<WORDS>) { 
  chop; # strip eol
  push @words, split; # break up words on line
}

# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
@words=sort { length($b) <=> length($a) } @words;

# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;

# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (@words) { 
     $text =~ s/\b\Q$word\E\s?//sg;
}

# output "fixed" text
print $text;

sample.txt

$ cat sample.txt
how about i decide to look at it afterwards what
across do you think is it a good idea to go out and about i 
think id rather go up and above

stopwords.txt

I
a
about
an
are
as
at
be
by
com
for
from
how
in
is
it
..

Output:

$ ./remove.pl stopwords.txt sample.txt 
i decide look fterwards cross do you think good idea go out d i 
think id rather go up d bove

As you can see, it replaces afterwards using a as fterwards. Think its a regex problem. Please can somebody help me to patch this quickly? Thanks for all the help :J

Upvotes: 2

Views: 211

Answers (2)

Gowtham
Gowtham

Reputation: 1475

Your regex is not strict enough.

$text =~ s/\b\Q$word\E\s?//sg;

When $word is a, the command is effectively s/\ba\s?//sg. This means, remove all occurrences of a new word starting with a followed by zero or more whitespace. In afterwards, this will successfully match the first a.

You can make the match more stricter by ending word with another \b. Like

$text =~ s/\b\Q$word\E\b\s?//sg;

Upvotes: 0

hjpotter92
hjpotter92

Reputation: 80639

Use word-boundary on both sides of your $word. Currently, you are only checking for it at the beginning.

You won't need the \s? condition with the \b in place:

$text =~ s/\b\Q$word\E\b//sg;

Upvotes: 1

Related Questions