Reputation: 67
I have the following code
chdir("c:/perl/normalized");
$docid=0;
my %hash = ();
@files = <*>;
foreach $file (@files)
{
$docid++;
open (input, $file);
while (<input>)
{
open (output,'>>c:/perl/tokens/total');
chomp;
(@words) = split(" ");
foreach $word (@words)
{
push @{ $hash{$word} }, $docid;
}
}
}
foreach $key (sort keys %hash) {
print output"$key : @{ $hash{$key} }\n";
}
close (input);
close (output);
This is a sample output in a file
of : 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 4 4 4 4 5 6 6 7 7 7 7 7 7 7 7 7
it is true since the term "of" for example existed 10(ten ones) times in the first document however is there a way to remove the repeated values; i.e instead of ten ones I want just one Thank you for your help
Upvotes: 2
Views: 5095
Reputation: 386676
To avoid adding the dups in the first place, change
foreach $word (@words)
to
foreach $word (uniq @words)
If you want to leave the dups in the data structure, instead change
print output"$key : @{ $hash{$key} }\n";
to
print output "$key : ", join(" ", uniq @{ $hash{$key} }), "\n";
uniq
is provided by List::MoreUtils.
use List::MoreUtils qw( uniq );
Or you can use
sub uniq { my %seen; grep !$seen{$_}++, @_ }
Upvotes: 5