Azaghal
Azaghal

Reputation: 430

Case insensitive search for multiple matches in a hash, merge the keys and add up the values

I have a hash containing a very big list of words (keys) and their frequency (values).
My problem is that the same word can appear several times with different cases, like this :

de => 14477841
la => 6577441
et => 5327316
PAR => 1670264
PaR => 1669878
PAr => 1669877    

When that happens, I would like to find all the different versions of a same word in the hash, regardless of the case, and merge them while adding up the values so here I'd get :

de => 14477841
la => 6577441
et => 5327316
par => 5010019

("par" is here in lower case but I don't really care, as long as there's only one version of it.)

I've tried to get the different keys in a array and check if a different version of each item of this list existed in the hash. There are a lot of different case patterns that I can't think of and have trouble to predict.

Here is a sample of my code, for what it's worth (it partially works but I still get duplicates)

my %hashoutput;
my %hash = map { my ( $key, $value ) = split "\t"; ( $key, $value ) } @lignes;

foreach $ligne (@lignes) #list of keys and values separated by a tab
{

($cleorigine, $valeur) = split /\t/, $ligne;    #get the key and value
$cle = $cleorigine =~ s/^([A-Z])/lc($1)/gr;     # different versions of it
$clemaj = $cleorigine =~ s/^([a-z])/uc($1)/ge;

    if ($cleorigine !~ /[0-9]{2}/g)
    {
        if ($ligne =~ /^([A-Z]|[ÉÈÊÂÀÙÛÇÔÎÏ])/g)
        {
            if (exists $hash{lc($cleorigine)})
            {
                $valeur1 = $valeur + $hash{lc($cleorigine)};    
                $hashoutput{ $cleorigine } = $valeur1;
            }
            if (not exists $hash{lc($cleorigine)})
            {
                if (exists $hash{$cle})
                {
                    $valeur2 = $valeur + $hash{$cle};
                    $hashoutput{ $cleorigine } = $valeur2;
                }
            }
        }
        elsif ($ligne =~ /^([a-z]|[éèêâàùûçôîï])/g)
        {

            if (exists $hash{$clemaj})
            {
            }
            elsif (not exists $hash{uc($clemaj)}) 
            {
                {
                    $hashoutput{ $cleorigine } = $valeur;
                }
            }

        }
    }
}

Is there a better / simpler way to do it ?

Upvotes: 0

Views: 105

Answers (1)

Borodin
Borodin

Reputation: 126732

Create a new hash from the old one by aggregating the values for equivalent keys

Like this. Note that the data is deleted from the original hash so as to save space. The fc operator does Unicode case folding so that it will work on non-ASCII characters

use strict;
use warnings 'all';
use feature 'fc';

my %data = (
    de  => 14477841,
    la  =>  6577441,
    et  =>  5327316,
    PAR =>  1670264,
    PaR =>  1669878,
    PAr =>  1669877,
);

my %new_data;

$new_data{ fc $_ } += delete $data{$_} for keys %data;

use Data::Dump 'dd';
dd \%new_data;

output

{ de => 14477841, et => 5327316, la => 6577441, par => 5010019 }

Upvotes: 3

Related Questions