Reputation: 153
I am parsing an output report from psiblast. I used COG alignments and searched a gene database for matches (homologues). One thing that I want to do is to find out which genes match to more than one COG. My partial script is below.
I am specifically having problems creating an array that holds all of the COGs for the genes that are assigned to multiple COGs.
I am getting the following error "Can't use string ("COG0003") as an ARRAY ref while "strict refs" in use at parse_POG_reports.pl line 26, line 67.".
I have looked at other posted relating to pushing elements into hashes of arrays. But I think the error might be occurring when one gene has 2 matches to the same COG, and it tries to push the same COG into the array (ie. the last 2 lines of the sample input). Does this make sense? If so, how can I avoid this problem?
use strict;
use warnings;
my %maxBits;my %COGhit_count;
my $Hohits={};my %COGhits;
my $COG_psi_report=$ARGV[0];
open (IN, $COG_psi_report) or die "cannot open $COG_psi_report\n";
while (my $line=<IN>){
next if ($line =~/^#/);
chomp $line;
my @columns = split(/\t/,$line);
my $bits=$columns[11];
my $COG=$columns[0];
my $hit=$columns[1];
my $Eval=$columns[10];
next if ($Eval > 0.00001); # threshold for significant hits set by DK
$COGhit_count{$hit}++; # count how many COGs each gene is homologous to
$COGhits{$hit}=$COG;
if ($COGhit_count{$hit}>1) {
push @{$COGhits{$hit}}, $COG; #
}
## for those that there are multiple hits we need to select top hit ##
if (!exists $maxBits{$hit}){
$maxBits{$hit}=$bits;
}
elsif (exists $maxBits{$hit} && $bits > $maxBits{$hit}){
$maxBits{$hit}=$bits;
}
$Hohits->{$hit}->{$bits}=$COG;
}
close (IN);
example Input:
POG0002 764184357-stool1_revised_scaffold22981_1_gene47608 23.90 159 112 3 1 156 1 153 2e-06 54.2
POG0002 764062976-stool2_revised_C999233_1_gene54902 23.63 182 121 5 3 169 2 180 2e-06 53.9
POG0002 763901136-stool1_revised_scaffold39447_1_gene145241 26.45 155 89 3 3 137 5 154 3e-06 53.9
POG0002 765701615-stool1_revised_C1349270_1_gene168522 23.53 187 115 5 3 169 2 180 5e-06 53.1
POG0002 158802708-stool2_revised_C1077267_1_gene26470 22.69 216 158 5 3 213 5 216 5e-06 52.7
POG0003 160502038-stool1_revised_scaffold47906_2_gene161164 33.00 297 154 6 169 424 334 626 6e-40 157
POG0003 160502038-stool1_revised_scaffold47906_2_gene161164 16.28 172 128 4 23 192 46 203 1e-06 56.6
POG0003 158337416-stool1_revised_C1254444_1_gene13533 30.06 346 184 7 133 424 57 398 6e-40 155
POG0003 158337416-stool1_revised_scaffold29713_1_gene153054 28.61 332 194 8 132 424 272 599 2e-38 152
POG0003 158337416-stool1_revised_scaffold29713_1_gene153054 24.00 200 131 5 1 193 5 190 9e-11 69.3
Upvotes: 2
Views: 589
Reputation: 29854
It's telling you exactly what you're doing wrong.
$COGhits{$hit}=$COG; # <--- scalar
if ($COGhit_count{$hit}>1) {
push @{$COGhits{$hit}}, $COG; # <--- array
}
You can't assign the value as a non-reference type and then try to autovivify it as a reference type. Perl will do the latter, but not if you've already stored a conflicting data type in that location.
Also, if this by some miracle worked the first time (it won't), and you ran this more than once, any array that you might have autoviv-ed by the push, would be clobbered by the scalar non-reference assignment.
I'm not sure what you're after, but the first line should probably be deleted.
Instead of that construct, you want to decide whether there will ever be more than one specification of $COG
for a value of $hit
. If there can be, simply replacing those 4 lines with the push
is the way to go.
I have done multi-purpose structure slots before, and they are largely a pain to maintain. But if you wanted to do something like that, you can do this:
my $ref = \$hashref->{ $key }; # autovivifies slot as simple scalar.
# it starts out as undefined.
if ( ref $$ref ) { # ref $ref will always be true
push @$$ref, $value;
}
else {
$$ref = defined( $$ref ) ? [ $$ref, $value ] : $value;
}
But you have to write bifurcated logic every time you want to access the mixed tree in some different way. The performance savings that you get with scalars, is somewhat eaten up by the tests and branching.
So I don't do too much of this anymore. I decide beforehand whether the relationship is 1-1 or 1-n. Routines like the ones below can make it more straightforward dealing with these kinds of tables, to a degree.
sub get_list_from_hash {
my ( $hash, $key ) = @_;
my $ref = \$hash->{ $key };
return unless defined( $$ref );
return ref( $$ref ) ? @$$ref : $$ref;
}
sub store_in_hash {
$_[0] = {} unless ref $_[0];
my ( $hash, $key, @values ) = @_;
my @defined = grep {; defined } @values;
unless ( @defined ) {
delete $hash->{ $key };
return;
}
my $ref = \$hash->{ $key };
if ( ref $$ref ) {
push @$$ref, @defined;
}
elsif ( defined $$ref ) {
$$ref = [ $$ref, @defined ];
}
elsif ( @values > 1 ) {
@$$ref = @defined;
}
else {
( $$ref ) = @defined;
}
}
Upvotes: 0
Reputation: 54323
You need to get rid of line 24 (counting backwards):
$COGhits{$hit}=$COG;
In it, you are setting $COGhits{$hit}
to a scalar value (the value of $COG
). Later, in line 26 you are trying to dereference $COGhits{$hit}
as an array to push into it. That doesn't work because there's a scalar in there.
Just remove the if
and change those lines into this. That should do the trick as now all those $hit
s are stored in array references.
$COGhit_count{$hit}++; # count how many COGs each gene is homologous to
push @{$COGhits{$hit}}, $COG;
Output of $COGhits
:
$VAR4 = {
'158802708-stool2_revised_C1077267_1_gene26470' => [
'POG0002'
],
'764062976-stool2_revised_C999233_1_gene54902' => [
'POG0002'
],
'764184357-stool1_revised_scaffold22981_1_gene47608' => [
'POG0002'
],
'765701615-stool1_revised_C1349270_1_gene168522' => [
'POG0002'
],
'763901136-stool1_revised_scaffold39447_1_gene145241' => [
'POG0002'
],
'160502038-stool1_revised_scaffold47906_2_gene161164' => [
'POG0003',
'POG0003'
]
};
If you however want both the scalar and the array ref, try this code. I don't recommend this, though.
$COGhit_count{$hit}++; # count how many COGs each gene is homologous to
if ($COGhit_count{$hit} == 1) {
$COGhits{$hit}=$COG; # Save as scalar
}
elsif ($COGhit_count{$hit} == 2) { # If we've just found the second hit,
my $temp = $COGhits{$hit}; # save the first and convert $COGhits{$hit}
$COGhits{$hit} = []; # to an array ref, then push both the old and
push @{$COGhits{$hit}}, $temp, $COG; # the new value in it.
} elsif ($COGhit_count{$hit} > 2) {
push @{$COGhits{$hit}}, $COG; # Just push the new value in
}
Thought: You probably had $COGhits{$hit}=$COG
first but then noticed that sometimes there can be more than one value, so you added the push
line, but you did not realized that you in fact had to replace the old line.
Upvotes: 1