Reputation: 430
I have a hash containing a very big list of words (keys) and their frequency (values).
My problem is that the same word can appear several times with different cases, like this :
de => 14477841
la => 6577441
et => 5327316
PAR => 1670264
PaR => 1669878
PAr => 1669877
When that happens, I would like to find all the different versions of a same word in the hash, regardless of the case, and merge them while adding up the values so here I'd get :
de => 14477841
la => 6577441
et => 5327316
par => 5010019
("par" is here in lower case but I don't really care, as long as there's only one version of it.)
I've tried to get the different keys in a array and check if a different version of each item of this list existed in the hash. There are a lot of different case patterns that I can't think of and have trouble to predict.
Here is a sample of my code, for what it's worth (it partially works but I still get duplicates)
my %hashoutput;
my %hash = map { my ( $key, $value ) = split "\t"; ( $key, $value ) } @lignes;
foreach $ligne (@lignes) #list of keys and values separated by a tab
{
($cleorigine, $valeur) = split /\t/, $ligne; #get the key and value
$cle = $cleorigine =~ s/^([A-Z])/lc($1)/gr; # different versions of it
$clemaj = $cleorigine =~ s/^([a-z])/uc($1)/ge;
if ($cleorigine !~ /[0-9]{2}/g)
{
if ($ligne =~ /^([A-Z]|[ÉÈÊÂÀÙÛÇÔÎÏ])/g)
{
if (exists $hash{lc($cleorigine)})
{
$valeur1 = $valeur + $hash{lc($cleorigine)};
$hashoutput{ $cleorigine } = $valeur1;
}
if (not exists $hash{lc($cleorigine)})
{
if (exists $hash{$cle})
{
$valeur2 = $valeur + $hash{$cle};
$hashoutput{ $cleorigine } = $valeur2;
}
}
}
elsif ($ligne =~ /^([a-z]|[éèêâàùûçôîï])/g)
{
if (exists $hash{$clemaj})
{
}
elsif (not exists $hash{uc($clemaj)})
{
{
$hashoutput{ $cleorigine } = $valeur;
}
}
}
}
}
Is there a better / simpler way to do it ?
Upvotes: 0
Views: 105
Reputation: 126732
Create a new hash from the old one by aggregating the values for equivalent keys
Like this. Note that the data is deleted from the original hash so as to save space. The fc
operator does Unicode case folding so that it will work on non-ASCII characters
use strict;
use warnings 'all';
use feature 'fc';
my %data = (
de => 14477841,
la => 6577441,
et => 5327316,
PAR => 1670264,
PaR => 1669878,
PAr => 1669877,
);
my %new_data;
$new_data{ fc $_ } += delete $data{$_} for keys %data;
use Data::Dump 'dd';
dd \%new_data;
{ de => 14477841, et => 5327316, la => 6577441, par => 5010019 }
Upvotes: 3