Reputation: 39
I have two files as input, a file containing a list of words StopWordsList.txt, I want to remove from StopWordsList.txt the words that are in StopWordsList.txt, here is my code:
my $FichierResulat = '/home/lenovo/Bureau/MesTravaux/LeskAlgo/OriginalLeskResult';
open( my $FhResultat, '>:utf8', $FichierResulat );
open( my $fh1, "<:utf8", '/home/lenovo/Bureau/MesTravaux/LeskAlgo/DemoLesk/StopWordsList.txt' )
or die "Failed to open file: $!\n"; #file contains stop words
open( my $fh2, "<:utf8", '/home/lenovo/Bureau/MesTravaux/LeskAlgo/text1.txt' ) #file contains text
or die "Failed to open file: $!\n";
my @tabStopWords = <$fh1>;
my @tab_contexte;
my @words;
while ( <$fh2> ) {
chomp;
next if m/^$/;
my $context = $_;
@words = split( / /, $_ );
}
#compare: remove from @words the words existing in @tabStopWords
my %temp;
@temp{@tabStopWords} = 0 .. $#tabStopWords;
for my $val ( @words ) {
if ( exists $temp{$val} ) {
print "$val est présent dans tab1 à la position $temp{$val}.\n";
}
else {
print "$val n'est pas dans tab1.\n";
push @tab_sans_SW, $val;
}
}
foreach my $value ( @tab_sans_SW ) {
print $FhResultat "$value\n";
}
but in the result file i have all the words existing in @words without removing the word that exist in @tabStopWords.. I hope tha can you help me.
my sotpwords file : ال الآن التي الذي الذين اللاتي اللائي اللتان اللتين
my texte file : ومواصلات بما فيه من بريد ونور ومياه وصناعات وعلوم ومعارف وحينما يركب احدنا قطارا فإنه يركب في نفس الوقت على حرية جاهزة اعدها له آلاف العمال والمخترعين والمهندسين في
Upvotes: 1
Views: 85
Reputation: 124
We can get the difference using smart match operator (~~),
my(@words_arr) = ("is","a");
my(@input_arr) = ("This","is","a","example","code");
my (@diff) = grep { not $_ ~~ @words_arr} @input_arr;
Upvotes: 0
Reputation: 126722
There are a couple of problems
You don't chomp
the contents of @tabStopWords
, so each entry has a newline at the end
You overwrite the contents of @words
each time around the while
loop with @words = split(/ /, $_)
instead of adding to it
This program will do what you want. I have added use autodie
to avoid having to check the result of every open
, and I have removed a couple of unused variables. Local variable names are better written using just lower-case letters and underscores, especially for readers whose first language isn't English
I've used split
on both files to reduce them both to individual words. Because split
also removes newline characters there is no need for chomp
use strict;
use warnings 'all';
use autodie;
use constant FICHIER_STOP_WORD => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/DemoLesk/StopWordsList.txt';
use constant FICHIER_TEXTE => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/text1.txt';
use constant FICHIER_RESULAT => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/OriginalLeskResult';
my @tab_stop_words = do {
open my $fh1, "<:utf8", FICHIER_STOP_WORD;
map { split } <$fh1>;
};
my @words = do {
open my $fh1, "<:utf8", FICHIER_TEXTE;
map { split } <$fh1>;
};
my %words = map { $words[$_] => $_ } 0 .. $#words;
open my $fh_resultat, '>:utf8', FICHIER_RESULAT;
for my $word ( @words ) {
my $position = $words{$word};
if ( defined $position ) {
print "$word est présent dans tab1 à la position $position.\n";
}
else {
print "$word n'est pas dans tab1.\n";
print $fh_resultat "$word\n";
}
}
Upvotes: 2
Reputation: 69224
This problem would be easier to solve if you showed us the format of your two input files. But as you don't, this will be guesswork.
I guess that your file of stopwords contains a single word on each line. In that case, each element in @tabStopWords
and, therefore, each key in %temp
will have newline at the end of them. This makes it extremely unlikely that any of the words in your source file will match these keys.
You probably want to add:
chomp @tabStopWords;
to your code.
Upvotes: 1