Reputation: 79
I have a large dataset (29 columns by 19000 rows) and I want to be able to compare values on each line and print a descriptive output.
Specifically, I want to query the value in column A (@WTcall) which is effectively a pass/fail statement. If the data fails I want to print the 'fail statement' and move on to the next line, but if the data passes I want to continue describing the data.
The next questions are to determine which classification the data in columns X (@positive) and Y (@negative) fall into:
(Ex:
If column X and column Y >= 0.6 then print "ABC"
If column X and column Y < 0.6 then print "CBA"
If column X >= 0.6 but column Y is < 0.6 print "DEF"
If column X < 0.6 but column Y is >= 0.6 print "FED"
else print "missing data". )
I have included the code that I wrote below as well as a subset of sample data.
The tests I ran before posting are commented out in the code. Briefly, if I comment out the list of 'if and elsif statements', print "@WTcall\t@positive\t@negative\n" and pipe it through a head command - my variables appear to be pulling out the correct information.
The trouble arises in the actual comparisons as every line gets classified with the "Methylated\tMethylated\n" description. It's not clear to me why this is. I know I have entries where the @WTcall column should match $BadPosition (the pass/fail check). Further, if I comment out the 'if statements' again, print "@WTcall\n$BadPosition" and pipe it through sort and uniq - I only get one value for "No_WT_Concensus" and so there should be no typo there or problems matching these values.
I'm sure the issue is obvious and staring me right in the face, so I really appreciate any help.
Thank you.
Code:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
die "Usage: $0 Filename\n" if @ARGV != 1;
my $file = shift;
my @line;
my @positive;
my @negative;
my @WTcall;
my $BadPosition = 'No_WT_Concensus';
my $i;
open my $infh, '<', $file;
while (<$infh>) {
chomp;
@line = split(/\t/,$_);
$WTcall[0]=$line[0];
$positive[0]=$line[14];
$negative[0]=$line[29];
#foreach $1 (<$infh>) {
foreach $1 (@WTcall) {
if (@WTcall eq $BadPosition){
print "No_WT_Concensus\tNo_WT_Concensus\n";
} elsif (@positive >= 0.6 && @negative >= 0.6){
print "Methylated\tMethylated\n";
} elsif (@positive <= 0.6 && @negative <= 0.6){
print "Under-methylated\tUnder-methylated\n";
} elsif(@positive >= 0.6 && @negative <=0.6){
print "Hemimethylated (m6A)\tHemimethylated (A)\n";
} elsif(@positive <= 0.6 && @negative >= 0.6){
print "Hemimethylated (A)\tHemimethylated (m6A)\n";
} else{
print "Missing_Site\tMissing_Site\n";
}
#print "@WTcall\n$BadPosition\n";
#print "@WTcall\t@positive\t@negative\n"
#print "@negative\n";
}
}
close $infh;
Sample Data:
Methylated coding gene 619 thrA NC_000913.3 pos 3 1 0.9535 1 NC_000913.3 619 + 18 0.8889 Methylated coding gene 620 thrA NC_000913.3 neg 3 0.9429 0.9756 0.9714 NC_000913.3 620 - 14 1
No_WT_Concensus coding gene 195410 ispU NC_000913.3 pos 2 0.5789 0.766 0.6071 NC_000913.3 195410 + 39 0.5897 Methylated coding gene 195411 ispU NC_000913.3 neg 3 0.75 0.9074 0.9306 NC_000913.3 195411 - 21 0.8095
Under-methylated pseudogene 3632965 yhiL NC_000913.3 pos 0 0.0323 0.1429 0.0962 NC_000913.3 3632965 + 22 0.0909 Under-methylated pseudogene 3632966 yhiL NC_000913.3 neg 0 0.1463 0.175 0.1429 NC_000913.3 3632966 - 23 0
Methylated intergenic 164636 hrpB-mrcB NC_000913.3 pos 3 0.7381 0.7647 0.7273 NC_000913.3 164636 + 25 0.8 Methylated intergenic 164637 hrpB-mrcB NC_000913.3 neg 3 0.7 0.7931 0.7213 NC_000913.3 164637 - 25 0.4
Methylated intergenic 269287 ykfA-perR NC_000913.3 pos 3 0.875 0.8833 0.931 NC_000913.3 269287 + 22 0.8182 Methylated intergenic 269288 ykfA-perR NC_000913.3 neg 3 0.8077 0.6866 0.6491 NC_000913.3 269288 - 17 0.5294
Methylated coding gene 4397856 mutL NC_000913.3 pos 3 0.9245 0.9831 0.9661 blank blank blank blank blank Methylated coding gene 4397857 mutL NC_000913.3 neg 3 0.9783 0.9808 0.9683 NC_000913.3 4397857 - 1 0
Methylated coding gene 4397969 mutL NC_000913.3 pos 3 0.9643 0.9524 1 blank blank blank blank blank Methylated coding gene 4397970 mutL NC_000913.3 neg 3 1 1 1 blank blank blank blank blank
Methylated coding gene 2761 thrA NC_000913.3 pos 3 0.9259 0.8654 0.9242 NC_000913.3 2761 + 31 1 Methylated coding gene 2762 thrA NC_000913.3 neg 3 0.913 0.9636 0.9767 NC_000913.3 2762 - 29 0.9655
Methylated coding gene 3073 thrB NC_000913.3 pos 3 0.9677 0.8983 1 NC_000913.3 3073 + 29 1 Methylated coding gene 3074 thrB NC_000913.3 neg 3 1 0.9038 0.9778 NC_000913.3 3074 - 31 1
Upvotes: 2
Views: 755
Reputation: 132730
Here are your requirements, edited to show parallel structure:
X >= 0.6 and Y >= 0.6 then "ABC"
X < 0.6 and Y < 0.6 then "CBA"
X >= 0.6 but Y < 0.6 then "DEF"
X < 0.6 but Y >= 0.6 then "FED"
Some of the requirements are < 0.6
, but in your code you have <= 0.6
.
You have two things to test and you should first look for a structure that tests each one only once. Here's so pseudocode that expresses that:
if X >= 0.6
if Y >= 0.6
"ABC"
else
"DEF"
else
if Y >= 0.6
"FED"
else
"CBA"
Once you know if a value is greater than or equal to a value, you also know if it is less than, so you don't have to test again. The test simply bifurcates; if you don't take one branch you must take the other.
Your code is a bit different because it matches both $x >= 0.6
or $x <= 0.6
, and the same for $y
. This means that if both $x
and $y
are 0.6
, then any brnach can match and you'll get the first one in the chain. That seems like an error in the code because it's not what you described in the text.
I've had to do many projects that had long lists of these sorts of selections, many with hundreds of things to test.
Now, the trick is to turn that into Perl. Remember that a subroutine returns the last evaluated expression, so this works:
my @x_y = (
[ 0.1, 0.7 ],
[ 0.1, 0.1 ],
[ 0.7, 0.1 ],
[ 0.7, 0.7 ]
);
foreach my $x_y ( @x_y ) {
printf "X: %.1f Y: %.1f --> %s\n", @$x_y, get_value( @$x_y );
}
sub get_value {
my( $x, $y ) = @_;
if( $x >= 0.6 ) { $y >= 0.6 ? 'ABC' : 'DEF' }
else { $y >= 0.6 ? 'FED' : 'CBA' }
}
I might even go so far as to parameterize the pivot value and give it a value if I don't pass one:
sub get_value {
my( $x, $y, $pivot ) = @_;
$pivot //= 0.6; # default value
if( $x >= $pivot ) { $y >= $pivot ? 'ABC' : 'DEF' }
else { $y >= $pivot ? 'FED' : 'CBA' }
}
With (experimental) subroutine signatures, that's a bit cleaner because I can set the default:
use v5.22;
use feature qw(signatures);
no warnings qw(experimental::signatures);
sub get_value ( $x, $y, $pivot = 0.6 ){
if( $x >= $pivot ) { $y >= $pivot ? 'ABC' : 'DEF' }
else { $y >= $pivot ? 'FED' : 'CBA' }
}
But things can get more interesting. I've repeated that test for $y
, but I can save the result of a comparison:
use v5.22;
use feature qw(signatures);
no warnings qw(experimental::signatures);
sub get_value ( $x, $y, $pivot = 0.6 ){
my $boolean = ($y >= $pivot);
if( $x >= $pivot ) { $boolean ? 'ABC' : 'DEF' }
else { $boolean ? 'FED' : 'CBA' }
}
But, what am I really doing here? I'm trying to select a value. I've represented this as a decision tree in code. What if I could flip that around? I can do the same saved result trick with $x
:
use v5.22;
use feature qw(signatures);
no warnings qw(experimental::signatures);
sub get_value ( $x, $y, $pivot = 0.6 ){
my $y_boolean = ($y >= $pivot);
my $x_boolean = ($x >= $pivot);
if( $x_boolean ) { $y_boolean ? 'ABC' : 'DEF' }
else { $y_boolean ? 'FED' : 'CBA' }
}
So now I have a situation where I have the boolean combinations (0,0), (0,1), (1,0), and (1,1). I'll use those as array indices. The state
makes a persistent varaible so I don't need to redefine it every time I call the subroutine. Perl v5.28 allows you to initialize arrays and hashes in state
, and with earlier versions you just use a reference:
use v5.28;
use feature qw(signatures);
no warnings qw(experimental::signatures);
sub get_value ( $x, $y, $pivot = 0.6 ) {
state @table = (
[ qw(CBA FED) ],
[ qw(DEF ABC) ]
);
my $y_boolean = ($y >= $pivot);
my $x_boolean = ($x >= $pivot);
$table[$x_boolean][$y_boolean];
}
Or, slightly more compact, I can put the comparisons in the array indices:
use v5.28;
use feature qw(signatures);
no warnings qw(experimental::signatures);
sub get_value ( $x, $y, $pivot = 0.6 ){
state @table = (
[ qw(CBA FED) ],
[ qw(DEF ABC) ]
);
$table[$x >= $pivot][$y >= $pivot];
}
Now the values are separated from the mechanics—something I spend a lot of time on in Mastering Perl. That can be a parameter too, although now it has to be an array ref because an array paramater cannot have a default value:
use v5.28;
use feature qw(signatures);
no warnings qw(experimental::signatures);
my @x_y = (
[ 0.1, 0.7 ],
[ 0.1, 0.1 ],
[ 0.7, 0.1 ],
[ 0.7, 0.7 ]
);
my $pivot = 0.6;
my @table = ( [ qw(CBA FED) ], [ qw(DEF ABC) ] );
foreach my $x_y ( @x_y ) {
printf "X: %.1f Y: %.1f --> %s\n", @$x_y,
get_value( @$x_y, $pivot, \@table );
}
sub get_value ( $x, $y, $pivot = 0.6,
@table = ([ qw(DEF FED) ], [ qw(ABC CBA) ]) )
{
$table[$x >= $pivot][$y >= $pivot];
}
This is much easier to manage. You can adjust the pivot and the values you get back while the mechanics don't change.
Going one step further takes the values out of the program completely and puts them in a configuration file. The same program can then handle many other situations without you having to edit it.
Back to your code. choroba shows you this, which fixes some of the problems but leaves the <=
issue in tact:
while (<$infh>) {
chomp;
my @columns = split /\t/;
my ($wt_call, $positive, $negative) = @columns[0, 14, 29];
if ($wt_call eq $BadPosition) {
print "No_WT_Concensus\tNo_WT_Concensus\n";
} elsif ($positive >= 0.6 && $negative >= 0.6) {
print "Methylated\tMethylated\n";
} elsif ($positive <= 0.6 && $negative <= 0.6) {
print "Under-methylated\tUnder-methylated\n";
} elsif ($positive >= 0.6 && $negative <=0.6) {
print "Hemimethylated (m6A)\tHemimethylated (A)\n";
} elsif ($positive <= 0.6 && $negative >= 0.6) {
print "Hemimethylated (A)\tHemimethylated (m6A)\n";
} else {
print "Missing_Site\tMissing_Site\n";
}
}
Cleaned up a bit, you have a while
that controls the parts dealing with each line, and a subroutine that handles the parts for control the values.
use v5.28;
use feature qw(signatures);
no warnings qw(experimental::signatures);
while( <$infh> ) {
chomp;
my( $wt_call, $positive, $negative ) = (split /\t/)[0,14,29];
if( $wt_call eq ... ) {
print "No_WT_Concensus\tNo_WT_Concensus\n";
next;
}
say get_value( $positive, $negative );
}
sub get_value ( $x, $y, $pivot = 0.6 ){
state @table = (
[ qw(CBA FED) ],
[ qw(DEF ABC) ]
);
$table[$x >= $pivot][$y >= $pivot];
}
Note that the else
condition is unreachable since you've covered all the cases already, so I omit that. If there's another situation where you want to look at empty fields (null versus 0), you'd capture that before get_value
. One way is to look at the length of the field. If it's 0 (no characters), don't count it. You can have 0, 1, or 2 empty fields, and those might be different cases:
while( <$infh> ) {
chomp;
my( $wt_call, $positive, $negative ) = (split /\t/)[0,14,29];
if( $wt_call eq ... ) {
print "No_WT_Concensus\tNo_WT_Concensus\n";
next;
}
# what if this is 1?
unless( 2 == grep { length } ($positive, $negative) ) {
print "No_WT_Concensus\tNo_WT_Concensus\n";
next;
}
say get_value( $positive, $negative );
}
Upvotes: 7
Reputation: 241748
Variables starting with the @
sigil are arrays. When comparing an array, you're imposing a numeric scalar context on it, so it returns its size.
You don't need arrays for single values, just use scalars.
Don't use the special variable $1
as a loop variable. It's confusing and cancels its special behaviour.
Here's how I'd rewrite your program. It still complains about comparing "blank" to a number, but I'm not sure what you want to do with those values.
#!/usr/bin/perl
use strict;
use warnings;
die "Usage: $0 Filename\n" if @ARGV != 1;
my $file = shift;
my $BadPosition = 'No_WT_Concensus';
open my $infh, '<', $file or die "$file: $!";
while (<$infh>) {
chomp;
my @columns = split /\t/;
my ($wt_call, $positive, $negative) = @columns[0, 14, 29];
if ($wt_call eq $BadPosition) {
print "No_WT_Concensus\tNo_WT_Concensus\n";
} elsif ($positive >= 0.6 && $negative >= 0.6) {
print "Methylated\tMethylated\n";
} elsif ($positive <= 0.6 && $negative <= 0.6) {
print "Under-methylated\tUnder-methylated\n";
} elsif ($positive >= 0.6 && $negative <=0.6) {
print "Hemimethylated (m6A)\tHemimethylated (A)\n";
} elsif ($positive <= 0.6 && $negative >= 0.6) {
print "Hemimethylated (A)\tHemimethylated (m6A)\n";
} else {
print "Missing_Site\tMissing_Site\n";
}
}
Upvotes: 4