Reputation: 1
I have a file with 16 different columns (tab-separated values):
22 51169729 G 39 A 0 0 C 0 0 G 38 0.974359 T 1 0.025641 22 51169730 A 36 A 36 1 C 0 0 G 0 0 T 0 0 22 51169731 C 39 A 0 0 C 39 1 G 0 0 T 0 0 22 51169732 G 37 A 0 0 C 0 0 G 37 1 T 0 0 22 51169733 G 33 A 0 0 C 0 0 G 33 1 T 0 0 22 51169734 C 35 A 0 0 C 35 1 G 0 0 T 0 0 22 51169735 A 32 A 32 1 C 0 0 G 0 0 T 0 0 22 51169736 G 32 A 0 0 C 0 0 G 32 1 T 0 0 22 51169737 C 30 A 0 0 C 30 1 G 0 0 T 0 0 22 51169738 T 27 A 0 0 C 0 0 G 0 0 T 27 1 22 51169739 G 26 A 0 0 C 0 0 G 26 1 T 0 0 22 51169740 A 25 A 25 1 C 0 0 G 0 0 T 0 0 22 51169741 C 22 A 0 0 C 22 1 G 0 0 T 0 0 22 51169742 G 23 A 0 0 C 0 0 G 23 1 T 0 0 22 51169743 C 21 A 0 0 C 21 1 G 0 0 T 0 0 22 51169744 C 22 A 0 0 C 22 1 G 0 0 T 0 0 22 51169745 C 19 A 0 0 C 19 1 G 0 0 T 0 0 22 51169746 C 19 A 0 0 C 19 1 G 0 0 T 0 0 22 51169747 A 15 A 14 0.933333 C 1 0.0666667 G 0 0 T 0 0 22 51169748 C 20 A 0 0 C 20 1 G 0 0 T 0 0
The third column can be A, G, C or T.
I would like to:
When this is done for the entire file, there would only be 4 columns left in some cases and 7 columns in other cases, like in the following example:
22 51169729 G 39 T 1 0.025641 22 51169730 A 36 22 51169731 C 39 22 51169732 G 37 22 51169733 G 33 22 51169734 C 35 22 51169735 A 32 22 51169736 G 32 22 51169737 C 30 22 51169738 T 27 22 51169739 G 26 22 51169740 A 25 22 51169741 C 22 22 51169742 G 23 22 51169743 C 21 22 51169744 C 22 22 51169745 C 19 22 51169746 C 19 22 51169747 A 15 C 2 0.133333 22 51169748 C 20
Any suggestions?
Upvotes: 0
Views: 114
Reputation: 204731
Here's one way to do the first part, assuming no empty fields:
$ cat tst.awk
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
{ gsub(/[[:space:]]+/,"\t"); print }
$ awk -f tst.awk file
1 957584 C 157 A 1 0.006 G 0 0 T 0 0
I don't really understand what you're trying to do in the 2nd part but it sounds like this might be what you want if the test on $7/10/13 is the modified field numbers after the first phase:
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
{ $0=$0 }
$7 ~ /0/ { c++ }
$10 ~ /0/ { c++ }
$13 ~ /0/ { c++ }
c > 1 { $8=$9=$10="" }
{ c=0; gsub(/[[:space:]]+/,"\t"); print }
or this if the test on $7/10/13 is the original field numbers:
$7 ~ /0/ { c++ }
$10 ~ /0/ { c++ }
$13 ~ /0/ { c++ }
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
c > 1 { $8=$9=$10="" }
{ c=0; gsub(/[[:space:]]+/,"\t"); print }
If not, edit your question to clarify with a better example.
Upvotes: 0
Reputation: 242443
Perl solution for the first part:
#!/usr/bin/perl
use warnings;
use strict;
my %remove = ( A => 4, # Where to start removing the columns
C => 7, # for a given character in column #3.
G => 10,
T => 13,
);
$\ = "\n"; # Add newline to prints.
$, = "\t"; # Separate values by tabs.
while (<>) { # Read input line by line;
chomp; # Remove newline.
my @F = split /\t/; # Split on tabs, populate an array.
splice @F, $remove{ $F[2] }, 3; # Remove the columns.
print @F; # Output.
}
Once you clarify the second requirement, I can try to add more code. What values do you want to remove? Can you show more examples?
Upvotes: 1