Selecting rows from a table based upon certain criteria

Question

I have a table with five tab delimited columns. The table is provided at the end. Column "Name1" has one or multiple "Name2". Each "Name2" has "Length", "Score", and "Target(s)".
I have to select only one "Name2" and its corresponding "Name1" based on following criteria.

(1) If there are only one type of "Name2" print "Name1","Name2", "Length" and "Score".
(2) If there are two or more type of "Name2" (from same "Name1"), compare the targets of "Name2". If the set of targets are same among all "Name2" then print one "Name2" which has high "Score". If the score is same print the "Name2" with highest "Length". If the "Score" and "Length" are same, print the first "Name2" with "Name1", "Score", and "Length".

I have written a program that can grab all the table information in hashes, however I am unable to write how to compare the targets of different "Name2" of same "Name1". I would appreciate If anyone can help to complete the programs. Thanks a lot. Program was excited as" ./Program.pl Inputfile.txt

(A) Input file

Name1    Name2    Length    Score    Target    
sjj2_2RSSE.1    sjj2_2RSSE.1#II_15269285_15270181    897    3    WBGene00007064
sjj2_2RSSE.1    sjj2_2RSSE.1#II_15269295_15270191    897    4    WBGene00007064
sjj2_AC3.1    sjj2_AC3.1#V_10368996_10369727    732    3    WBGene00005532
sjj2_AC3.2    sjj2_AC3.2#V_10373256_10373988    733    3    WBGene00007070
sjj2_AC3.2    sjj2_AC3.2#V_10373256_10373988    733    3    WBGene00007028
sjj2_AC3.2    sjj2_AC3.2#V_10373256_10373988    733    3    WBGene00007019
sjj2_AC3.2    sjj2_AC3.2#V_10373256_10376356    3101    2    WBGene00007070
sjj2_AC3.2    sjj2_AC3.2#V_10373256_10376356    3101    2    WBGene00007028
sjj2_AC3.2    sjj2_AC3.2#V_10373256_10376356  3101    2    WBGene00007019
sjj2_AC3.6    sjj2_AC3.6#V_10393744_10394300    557    3    WBGene00000724
sjj2_AH10.1    sjj2_AH10.1#V_14146901_14148094    1194    4    WBGene00007082
sjj2_AH6.10    sjj2_AH6.10#II_9548665_9549674    1010    3    WBGene00003177
sjj2_AH6.10    sjj2_AH6.10#II_9548675_9549684    1010    2    WBGene00003177

(B) Expected Output

Name1     Name2    Length    Score
sjj2_2RSSE.1    sjj2_2RSSE.1#II_15269295_15270191    897    4
sjj2_AC3.1    sjj2_AC3.1#V_10368996_10369727        732    3
sjj2_AC3.2    sjj2_AC3.2#V_10373256_10373988        733    3
sjj2_AC3.6    sjj2_AC3.6#V_10393744_10394300        557    3
sjj2_AH10.1    sjj2_AH10.1#V_14146901_14148094    1194    4
sjj2_AH6.10    sjj2_AH6.10#II_9548665_9549674    1010    3

Program.pl

#!/usr/bin/perl
use Data::Dumper;


%data=();
@arraysiRNA=();
$i=0;
$file1=$ARGV[0];
open(FP1, $file1);
while($siRNA=)
{
    chomp($siRNA);

    @aa=split(/	/,$siRNA);
    ($clone_id,$amplicon_id,$amplicon_length,$amplicon_evidence,$amplicon_target)=split /	/,$siRNA; 

    if(exists $data{$clone_id})
    {


        $data{$clone_id}{$amplicon_id}{amplicon_length}=$amplicon_length;
        $data{$clone_id}{$amplicon_id}{amplicon_evidence}=$amplicon_evidence;
        push( @{ $data{$clone_id}{$amplicon_id}{amplicon_target}}, $amplicon_target);

    }
    else
    {
    $data{$clone_id}{$amplicon_id}{amplicon_length}=$amplicon_length;
    $data{$clone_id}{$amplicon_id}{amplicon_evidence}=$amplicon_evidence;
    push( @{ $data{$clone_id}{$amplicon_id}{amplicon_target}}, $amplicon_target);
    }

    $i++;
}


#print Dumper(\%data);  


foreach $Name1 (keys %data)
{
    foreach $Name2 (keys %{$data{$Name1}})
    {
    $len=$data{$Name1}{$Name2}{amplicon_length};
    $evid=$data{$Name1}{$Name2}{amplicon_evidence};
    @tar=@{$data{$Name1}{$Name2}{amplicon_target}};
        #select only unique targets
    @uniqueTar = do { my %seen; grep { !$seen{$_}++ } @tar };
    print "$Name1	$Name2	amplicon_length= $len
";
    print "amplicon_evidence= $evid
";
    print "amplicon_target= @tar
";
    print "amplicon_target uniq= @uniqueTar
";

    }
}



close FP1;

Todd · Accepted Answer

The trick here is to be able to quickly know how many keys or array elements there are in a nested hash. I use this pattern frequently.

scalar keys %{ $hash{$key} }

and

scalar @{ $hash{$key} }

returns the number of elements. So if you want to see if a hash key has only 1 subkey:

if (scalar keys %{ $hash{$key} } == 1) {

Using this pattern, you can now check for the cases you've defined.

You defined two cases:

clone has 1 amplicon
clone has multiple amplicons, all amplicons have same target

For case 2, add the following at lines 24 and 31:

$data2{$clone_id}{'amplicons'}{$amplicon_target} = 1;

Now just add another loop to analyze (this replaces your bottom loop, after the Dump).

foreach my $clone (keys %data) {

    # case 1, clone only has 1 amplicon
    if (scalar keys %{$data{$clone}} == 1) {


    } else {

        # case 2, clone has >1 amplicon but all have same target
        if (scalar keys %{$data2{$clone}{'amplicons'} == 1) {

        }

    }
}

Selecting rows from a table based upon certain criteria

Answers (1)

Related Questions