Reputation: 39
I want to calculate the number of sentences that contain a word. I have 2 files, one contains sentences and the other contains words, so for each word I would count the number of sentences containing this word.
This is my code:
use strict;
use autodie;
use autodie;
open my $fh_resultat, ">:utf8", 'out';
use constant CORPUS_MOT => 'test';
use constant CORPUS_Phrases => 'phrases';
my @tab_MOT_CORPUS = do {
open my $fh1, "<:utf8", CORPUS_MOT;
map { split } <$fh1>;
};
my @tab_phrase_CORPUS = do {
open my $fh2, "<:utf8", CORPUS_Phrases;
map { split } <$fh2>;
};
foreach my $mot (@tab_MOT_CORPUS) {
my $nb_phrase = 0;
foreach my $ph (@tab_phrase_CORPUS) {
my @tab = split(/ /, $ph);
chomp @tab ;
#it should quit foreach if mot == val
foreach my $val(@tab) {
if ($mot eq $val) {
$nb_phrase = $nb_phrase + 1;
last;
}
}
}
print $fh_resultat "$mot:$nb_phrase\n";
}
print "$nbre_ligne\n";
For example if I have these 2 sentences:
word1 is in sentence1 word1
word2 is in sentence2
the result should be:
word1:1
word2:1
Upvotes: 0
Views: 133
Reputation: 385657
The code expects @tab_phrase_CORPUS
to contain lines, but it contains words.
my @tab_phrase_CORPUS = do {
open my $fh2, "<:utf8", CORPUS_Phrases;
map { split } <$fh2>;
};
should be
my @tab_phrase_CORPUS = do {
open my $fh2, "<:utf8", CORPUS_Phrases;
map { chomp; $_ } <$fh2>;
};
Tip: Remove chomp @tab;
. The newlines have already been removed as you read from the file, which is the proper time to do it.
Tip: my @tab = split(/ /, $ph);
is better written as my @tab = split(' ', $ph);
. The former splits on individual spaces, the latter is a special case that splits on whitespace.
Upvotes: 4