Reputation: 91
I have a table with five tab delimited columns. The table is provided at the end. Column "Name1" has one or multiple "Name2". Each "Name2" has "Length", "Score", and "Target(s)".
I have to select only one "Name2" and its corresponding "Name1" based on following criteria.
(1) If there are only one type of "Name2" print "Name1","Name2", "Length" and "Score".
(2) If there are two or more type of "Name2" (from same "Name1"), compare the targets of "Name2". If the set of targets are same among all "Name2" then print one "Name2" which has high "Score". If the score is same print the "Name2" with highest "Length". If the "Score" and "Length" are same, print the first "Name2" with "Name1", "Score", and "Length".
I have written a program that can grab all the table information in hashes, however I am unable to write how to compare the targets of different "Name2" of same "Name1". I would appreciate If anyone can help to complete the programs. Thanks a lot. Program was excited as" ./Program.pl Inputfile.txt
(A) Input file
Name1 Name2 Length Score Target
sjj2_2RSSE.1 sjj2_2RSSE.1#II_15269285_15270181 897 3 WBGene00007064
sjj2_2RSSE.1 sjj2_2RSSE.1#II_15269295_15270191 897 4 WBGene00007064
sjj2_AC3.1 sjj2_AC3.1#V_10368996_10369727 732 3 WBGene00005532
sjj2_AC3.2 sjj2_AC3.2#V_10373256_10373988 733 3 WBGene00007070
sjj2_AC3.2 sjj2_AC3.2#V_10373256_10373988 733 3 WBGene00007028
sjj2_AC3.2 sjj2_AC3.2#V_10373256_10373988 733 3 WBGene00007019
sjj2_AC3.2 sjj2_AC3.2#V_10373256_10376356 3101 2 WBGene00007070
sjj2_AC3.2 sjj2_AC3.2#V_10373256_10376356 3101 2 WBGene00007028
sjj2_AC3.2 sjj2_AC3.2#V_10373256_10376356 3101 2 WBGene00007019
sjj2_AC3.6 sjj2_AC3.6#V_10393744_10394300 557 3 WBGene00000724
sjj2_AH10.1 sjj2_AH10.1#V_14146901_14148094 1194 4 WBGene00007082
sjj2_AH6.10 sjj2_AH6.10#II_9548665_9549674 1010 3 WBGene00003177
sjj2_AH6.10 sjj2_AH6.10#II_9548675_9549684 1010 2 WBGene00003177
(B) Expected Output
Name1 Name2 Length Score
sjj2_2RSSE.1 sjj2_2RSSE.1#II_15269295_15270191 897 4
sjj2_AC3.1 sjj2_AC3.1#V_10368996_10369727 732 3
sjj2_AC3.2 sjj2_AC3.2#V_10373256_10373988 733 3
sjj2_AC3.6 sjj2_AC3.6#V_10393744_10394300 557 3
sjj2_AH10.1 sjj2_AH10.1#V_14146901_14148094 1194 4
sjj2_AH6.10 sjj2_AH6.10#II_9548665_9549674 1010 3
Program.pl
#!/usr/bin/perl
use Data::Dumper;
%data=();
@arraysiRNA=();
$i=0;
$file1=$ARGV[0];
open(FP1, $file1);
while($siRNA=<FP1>)
{
chomp($siRNA);
@aa=split(/\t/,$siRNA);
($clone_id,$amplicon_id,$amplicon_length,$amplicon_evidence,$amplicon_target)=split /\t/,$siRNA;
if(exists $data{$clone_id})
{
$data{$clone_id}{$amplicon_id}{amplicon_length}=$amplicon_length;
$data{$clone_id}{$amplicon_id}{amplicon_evidence}=$amplicon_evidence;
push( @{ $data{$clone_id}{$amplicon_id}{amplicon_target}}, $amplicon_target);
}
else
{
$data{$clone_id}{$amplicon_id}{amplicon_length}=$amplicon_length;
$data{$clone_id}{$amplicon_id}{amplicon_evidence}=$amplicon_evidence;
push( @{ $data{$clone_id}{$amplicon_id}{amplicon_target}}, $amplicon_target);
}
$i++;
}
#print Dumper(\%data);
foreach $Name1 (keys %data)
{
foreach $Name2 (keys %{$data{$Name1}})
{
$len=$data{$Name1}{$Name2}{amplicon_length};
$evid=$data{$Name1}{$Name2}{amplicon_evidence};
@tar=@{$data{$Name1}{$Name2}{amplicon_target}};
#select only unique targets
@uniqueTar = do { my %seen; grep { !$seen{$_}++ } @tar };
print "$Name1\t$Name2\tamplicon_length= $len\n";
print "amplicon_evidence= $evid\n";
print "amplicon_target= @tar\n";
print "amplicon_target uniq= @uniqueTar\n";
}
}
close FP1;
Upvotes: 1
Views: 83
Reputation: 426
The trick here is to be able to quickly know how many keys or array elements there are in a nested hash. I use this pattern frequently.
scalar keys %{ $hash{$key} }
and
scalar @{ $hash{$key} }
returns the number of elements. So if you want to see if a hash key has only 1 subkey:
if (scalar keys %{ $hash{$key} } == 1) {
Using this pattern, you can now check for the cases you've defined.
You defined two cases:
clone has 1 amplicon
clone has multiple amplicons, all amplicons have same target
For case 2, add the following at lines 24 and 31:
$data2{$clone_id}{'amplicons'}{$amplicon_target} = 1;
Now just add another loop to analyze (this replaces your bottom loop, after the Dump).
foreach my $clone (keys %data) {
# case 1, clone only has 1 amplicon
if (scalar keys %{$data{$clone}} == 1) {
} else {
# case 2, clone has >1 amplicon but all have same target
if (scalar keys %{$data2{$clone}{'amplicons'} == 1) {
}
}
}
Upvotes: 1