Reputation: 67
I have read so many forms on how to remove stop words from files, my code remove many other things but I want to include also stop words. This is how far I reached, but I don't know what I am missing. Please Advice
use Lingua::StopWords qw(getStopWords);
my $stopwords = getStopWords('en');
chdir("c:/perl/input");
@files = <*>;
foreach $file (@files)
{
open (input, $file);
while (<input>)
{
open (output,">>c:/perl/normalized/".$file);
chomp;
#####What should I write here to remove the stop words#####
$_ =~s/<[^>]*>//g;
$_ =~ s/\s\.//g;
$_ =~ s/[[:punct:]]\.//g;
if($_ =~ m/(\w{4,})\./)
{
$_ =~ s/\.//g;
}
$_ =~ s/^\.//g;
$_ =~ s/,/' '/g;
$_ =~ s/\(||\)||\\||\/||-||\'//g;
print output "$_\n";
}
}
close (input);
close (output);
Upvotes: 0
Views: 921
Reputation: 42421
# Always use these in your Perl programs.
use strict;
use warnings;
use File::Basename qw(basename);
use Lingua::StopWords qw(getStopWords);
# It's often better to build scripts that take their input
# and output locations as command-line arguments rather than
# being hard-coded in the program.
my $input_dir = shift @ARGV;
my $output_dir = shift @ARGV;
my @input_files = glob "$input_dir/*";
# Convert the hash ref of stop words to a regular array.
# Also quote any regex characters in the stop words.
my @stop_words = map quotemeta, keys %{getStopWords('en')};
for my $infile (@input_files){
# Open both input and output files at the outset.
# Your posted code reopened the output file for each line of input.
my $fname = basename $infile;
my $outfile = "$output_dir/$fname";
open(my $fh_in, '<', $infile) or die "$!: $infile";
open(my $fh_out, '>', $outfile) or die "$!: $outfile";
# Process the data: you need to iterate over all stop words
# for each line of input.
while (my $line = <$fh_in>){
$line =~ s/\b$_\b//ig for @stop_words;
print $fh_out $line;
}
# Close the files within the processing loop, not outside of it.
close $fh_in;
close $fh_out;
}
Upvotes: 0
Reputation: 52049
The stop words are the keys of %$stopwords
which have the value 1, i.e.:
@stopwords = grep { $stopwords->{$_} } (keys %$stopwords);
It might happen be true that the stop words are just the keys of %$stopwords
, but according the the Lingua::StopWords
docs you also need to check the value associated with the key.
Once you have the stop words, you can remove them with code like this:
# remove all occurrences of @stopwords from $_
for my $w (@stopwords) {
s/\b\Q$w\E\b//ig;
}
Note the use of \Q...\E
to quote any regular expression meta-characters that might appear in the stop word. Even though it is very unlikely that stop words will contains meta-characters, this is a good practice to follow any time you want to represent a literal string in a regular expression.
We also use \b
to match a word boundary. This helps ensure that we won't a stop word that occurs in the middle of another word. Hopefully this will work for you - it depends a lot on what your input text is like - i.e. do you have punctuation characters, etc.
Upvotes: 2