Pink
Pink

Reputation: 161

Perl Hashes and regex

I am working on a code that splits sentence into individual words, the words are then searched against hash keys for their presence. My code returns terms that are 100% same, after a match I tag the word from the sentence with the value that corresponds to the matching key. The problem is the code tags terms but with random values not with what I expect. Also, there are situations where the term and the hash key are similar but not 100% identical, how can I write a regular expression to match my terms with the keys. Note: I have stemmed the hash keys to their root forms. I cam provide some examples: If the term from the sentence is Synergistic or anti-synergistic, and my hash key is Synerg, how can I match the above term with Synerg.

My code is as follows:

    open IN, "C:\\Users\\Desktop\\TM\\clean_cells.txt" or die "import file absent";
    my %hash=();
    use Tie::IxHash;
    tie %hash => "Tie::IxHash";
    while(<IN>)
    {
    chomp $_;
    $line=lc $_;
    @Organs=split/\t/, $line;
    $hash{$Organs[0]}=$Organs[1];
    }

    $Sentence="Lymphoma is Lymph Heart and Lung";
     @list=split/ /,$Sentence;

     @array=();
 foreach $term(@list)
 {
 chomp $term;
    for $keys(keys %hash)
     {
    if($hash{$term})
     {
     $cell="<$hash{$keys}>$term<\/$hash{$keys}>";
     push(@array, $cell);
    }
    elsif($term=~m/\b\Q$keys(\w+)\E\b/)
    {
    $cell="<$hash{$keys}>$term<\/$hash{$keys}>";
     push(@array, $cell);        
    }
    elsif($term=~m/\b\Q(\w+)$keys\E\b/)
    {
    $cell="<$hash{$keys}>$term<\/$hash{$keys}>";
     push(@array, $cell);        
    }
    elsif($term=~m/\b\Q(\w+)$keys(\w+)\E\b/)
    {
    $cell="<$hash{$keys}>$term<\/$hash{$keys}>";
     push(@array, $cell);        
     }
}
}
print @array;

 for example: hash looks like this: %hash={
                                      TF1    => Lymph
                                Thoracic_duct =>    Lymph
                                    SK-MEL-1 => Lymph
                                       Brain => Brain
                                     Cerebellum =>  Brain
                                         };
   So if the term TF1 is found it should be substituted to Lymph TF1 /Lymph 

Upvotes: 0

Views: 267

Answers (1)

dan1111
dan1111

Reputation: 6566

I found two big problems that were preventing your code from working:

  • You are making the keys to your hash lowercase, but you are not doing the same for the terms in $Sentence. Thus, uppercase words from $Sentence will never match.
  • The \Q...\E modifier disables regex meta-characters. While it is often good to do this when interpolating a variable, you cannot use expressions like (\w+) in there--that will look for the literal characters (\w+). Those regexes need to be rewritten like this: m/\b\Q$keys\E(\w+)\b/.

There are other design issues with your code, as well:

  1. You are using undeclared global variables all over the place. You should declare all variables with my. Always turn on use strict; use warnings;, which will force you to do this correctly.
  2. There doesn't appear to be any reason for Tie::IxHash, which causes your hash to be ordered. You don't use this ordering in any way in your code. The output is ordered by @list. I would do away with this unnecessary module.
  3. Your if/elsif statements are redundant. if($term=~m/\b\Q(\w*)$keys(\w*)\E\b/) will accomplish the same thing as all of them combined. Note that I replaced \w+ with \w*. This allows the groups before and after to match zero or more characters instead of one or more characters.

Note: I didn't bother testing with Tie::IxHash, since I don't have that module and it appears unnecessary. It's possible that using this module is also introducing other problems in your code.

Upvotes: 1

Related Questions