Reputation: 409
I'm having issues with a script I've written. I've broken it down to identify the problem and here it is.
Input file (tab delimited):
FORMAT Sample1 Sample2 Sample3
GT:AD:DP:GQ:PL 0/1:17,6:23:85:85,0,370 0/0:51,6:57:17:0,17,1359 0/0:3,0:3:9:0,9,99
GT:AD:DP:GQ:PGT:PID:PL 0/0:3,0:3:0:.:.:0,0,38 0/0:1,0:1:3:.:.:0,3,33 0/1:1,2:3:26:0|1:13813_T_G:81,0,26
GT:AD:DP:GQ:PGT:PID:PL ./.:2,0:2:.:.:.:0,0,0 0/0:1,0:1:3:.:.:0,3,33 0/1:1,2:3:26:0|1:13813_T_G:81,0,26
GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 1/1:0,4:4:12:131,12,0 ./.:0,0:0:.:0,0,0
GT:AD:DP:GQ:PGT:PID:PL 1/1:0,2:2:6:1|1:14590_G_A:90,6,0 0/0:3,0:3:9:.:.:0,9,98 0/0:1,0:1:3:.:.:0,3,30
GT:AD:DP:GQ:PGT:PID:PL 1/1:0,2:2:6:1|1:14590_G_A:90,6,0 0/0:3,0:3:9:.:.:0,9,98 0/0:1,0:1:3:.:.:0,3,30
GT:AD:DP:GQ:PGT:PID:PL 1/1:0,2:2:6:1|1:14590_G_A:90,6,0 0/0:3,0:3:9:.:.:0,9,98 0/0:1,0:1:3:.:.:0,3,30
GT:AD:DP:GQ:PGT:PID:PL 1/1:0,2:2:6:1|1:14590_G_A:90,6,0 0/0:2,0:2:6:.:.:0,6,72 0/0:1,0:1:3:.:.:0,3,30
GT:AD:DP:GQ:PL 1/1:0,7:7:21:186,21,0 0/1:5,4:9:79:79,0,103 ./.:1,0:1:.:0,0,0
Desired output (take the first 3 characters before the colon from each sample) per line and print each line:
GT:AD:DP:GQ:PL 0/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL 0/0 0/0 0/1
GT:AD:DP:GQ:PGT:PID:PL ./. 0/0 0/1
GT:AD:DP:GQ:PL ./. 1/1 ./.
GT:AD:DP:GQ:PGT:PID:PL 1/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL 1/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL 1/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL 1/1 0/0 0/0
GT:AD:DP:GQ:PL 1/1 0/1 ./.
The code I'm using to do this step is not producing the correct 0/0, 0/1, 0/2 codes as expected per line. I think it is an issue with how I've written the for loop but I'm not sure.
#!/usr/bin/perl
use strict;
my $inputfile1 = $ARGV[0];
open (FILE1, $inputfile1) or die "Uh oh.. unable to find file $inputfile1"; ##Opens input file
my @file1 = <FILE1>; #loads inputfile1 data into array
close FILE1;
my (@colsplit, @genotypes1, @genotypes2, @genotypes3, @joined);
foreach my $line(@file1) { ## process each line, splitting columns and move onto next line
@colsplit = split("\t", $line);
push (@joined, $colsplit[0]);
foreach my $lines(@colsplit) {
if ($colsplit[1] =~ m/(^0\/1)/ || $colsplit[1] =~ m/(^0\/0)/ || $colsplit[1]=~ m/(^1\/0)/ || $colsplit[1] =~ m/(^1\/1)/ || $colsplit[1] =~ m/(^.\/.)/) {
push (@genotypes1, $1);
}
if ($colsplit[2] =~ m/(^0\/1)/ || $colsplit[2] =~ m/(^0\/0)/ || $colsplit[2] =~ m/(^1\/0)/ || $colsplit[2] =~ m/(^1\/1)/ || $colsplit[2] =~ m/(^.\/.)/) {
push (@genotypes2, $1);
}
if ($colsplit[3] =~ m/(^0\/1)/ || $colsplit[3] =~ m/(^0\/0)/ || $colsplit[3] =~ m/(^1\/0)/ || $colsplit[3] =~ m/(^1\/1)/ || $colsplit[3] =~ m/(^.\/.)/) {
push (@genotypes3, $1);
}
}
}
my $i = 0;
foreach my $line(@joined) {
if ($line =~ m/GT/) {
print "$line\t$genotypes1[$i]\t$genotypes2[$i]\t$genotypes3[$i]\n";
$i++;
}}
I think the issue may be that once the first sample1 column is matched, it jumps to the second line rather than iterating through the second sample2 column. I can't see how I'm messing this up otherwise! It's driving me nuts!
My current output is:
GT:AD:DP:GQ:PL 0/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL 0/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL 0/1 0/0 0/0
GT:AD:DP:GQ:PL 0/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL 0/0 0/0 0/1
GT:AD:DP:GQ:PGT:PID:PL 0/0 0/0 0/1
GT:AD:DP:GQ:PGT:PID:PL 0/0 0/0 0/1
GT:AD:DP:GQ:PGT:PID:PL 0/0 0/0 0/1
GT:AD:DP:GQ:PL ./. 0/0 0/1
This is clearly not what I want!
Any help would be gratefully received.
Ps. I'm new to this so go easy.
Upvotes: 1
Views: 69
Reputation: 37472
One possible way is to use the continuation modifier (c
) of regular expression matching. It causes a matching operation to start from where the last one found a match. Like that, each field of a line can be processed.
#!/usr/bin/perl
use strict;
use warnings;
my $fh;
$ARGV[0] && open($fh, $ARGV[0]) || die();
# discard first line;
<$fh>;
foreach my $line (<$fh>) {
chomp($line);
# capture every non tab character from the beginning of the line;
# globally match the pattern repeatedly in the string (g modifier)
# keep the current position during repeated matching (c modifier);
$line =~ m/^([^\t]*)/gc;
print("$1");
# capture every non colon character after a tab character;
# globally match the pattern repeatedly in the string (g modifier);
# keep the current position during repeated matching (c modifier);
while ($line =~ m/\t([^:]*)/gc) {
print("\t$1");
}
print("\n");
}
Upvotes: 0
Reputation: 409
I believe I have fixed the issue. I removed foreach my $lines(@colsplit) {
and now it works! Code below:
#!/usr/bin/perl
use strict;
my $inputfile1 = $ARGV[0];
open (FILE1, $inputfile1) or die "Uh oh.. unable to find file $inputfile1"; ##Opens input file
my @file1 = <FILE1>; #loads inputfile1 data into array
close FILE1;
my (@colsplit, @genotypes1, @genotypes2, @genotypes3, @joined);
foreach my $line(@file1) { ## process each line, splitting columns and move onto next line
@colsplit = split("\t", $line);
push (@joined, $colsplit[0]);
if ($colsplit[1] =~ m/(^0\/1)/ || $colsplit[1] =~ m/(^0\/0)/ || $colsplit[1]=~ m/(^1\/0)/ || $colsplit[1] =~ m/(^1\/1)/ || $colsplit[1] =~ m/(^.\/.)/) {
push (@genotypes1, $1);
}
if ($colsplit[2] =~ m/(^0\/1)/ || $colsplit[2] =~ m/(^0\/0)/ || $colsplit[2] =~ m/(^1\/0)/ || $colsplit[2] =~ m/(^1\/1)/ || $colsplit[2] =~ m/(^.\/.)/) {
push (@genotypes2, $1);
}
if ($colsplit[3] =~ m/(^0\/1)/ || $colsplit[3] =~ m/(^0\/0)/ || $colsplit[3] =~ m/(^1\/0)/ || $colsplit[3] =~ m/(^1\/1)/ || $colsplit[3] =~ m/(^.\/.)/) {
push (@genotypes3, $1);
}
}
my $i = 0;
foreach my $line(@joined) {
if ($line =~ m/GT/) {
print "$line\t$genotypes1[$i]\t$genotypes2[$i]\t$genotypes3[$i]\n";
$i++;
}}
Thanks to @zdim especially for showing me a more elegant solution. E
Upvotes: 0
Reputation: 66883
The question seems clear, to keep data before :
in each column (except in the first). But then I am a little confused by the attempted code, which needlessly goes through specific patterns for columns.
Here is a simple take on what is described
use warnings;
use strict;
use feature 'say';
my $header_line = <>; # drop the first line
while (<>) { # line by line from files given on cmdline, or STDIN
chomp;
my ($fmt_col, @cols) = split /\t/; # (but sample has spaces, not tabs)
s/(.*?):.*/$1/ for @cols; # keep up to (first) : in each field
say join "\t", $fmt_col, @cols; # print fields joined by tabs
}
The <>
reads all lines from all files given on the command-line, or STDIN
; so submit a filename to process on the command line when running this program.
Note that the posted sample doesn't have tabs, but rather spaces; so the above code will fail if one copy-pastes it. Either change spaces to tabs to test, or change split /\t/;
to split;
(so to use its default to split on, which is any amount of any whitespace).
All fields other than the first are changed so that only characters up to the first :
stay.
This is done using the fact that each item processed in the foreach
loop ("topicalizer") is aliased to the currently processed element. So when the regex s///
changes it then the corresponding element of @cols
is changed. If this seems too cooky to stomach please by all means write it out nice and slow.
If something else is actually needed to do please clarify.
Upvotes: 2