slien
slien

Reputation: 1

delete specific columns when other column has specific value (perl or awk)

I have a file with 16 different columns (tab-separated values):

22    51169729    G   39  A   0   0   C   0   0   G   38  0.974359    T   1   0.025641
22    51169730    A   36  A   36  1   C   0   0   G   0   0   T   0   0
22    51169731    C   39  A   0   0   C   39  1   G   0   0   T   0   0
22    51169732    G   37  A   0   0   C   0   0   G   37  1   T   0   0
22    51169733    G   33  A   0   0   C   0   0   G   33  1   T   0   0
22    51169734    C   35  A   0   0   C   35  1   G   0   0   T   0   0
22    51169735    A   32  A   32  1   C   0   0   G   0   0   T   0   0
22    51169736    G   32  A   0   0   C   0   0   G   32  1   T   0   0
22    51169737    C   30  A   0   0   C   30  1   G   0   0   T   0   0
22    51169738    T   27  A   0   0   C   0   0   G   0   0   T   27  1
22    51169739    G   26  A   0   0   C   0   0   G   26  1   T   0   0
22    51169740    A   25  A   25  1   C   0   0   G   0   0   T   0   0
22    51169741    C   22  A   0   0   C   22  1   G   0   0   T   0   0
22    51169742    G   23  A   0   0   C   0   0   G   23  1   T   0   0
22    51169743    C   21  A   0   0   C   21  1   G   0   0   T   0   0
22    51169744    C   22  A   0   0   C   22  1   G   0   0   T   0   0
22    51169745    C   19  A   0   0   C   19  1   G   0   0   T   0   0
22    51169746    C   19  A   0   0   C   19  1   G   0   0   T   0   0
22    51169747    A   15  A   14  0.933333    C   1   0.0666667   G   0   0   T   0   0
22    51169748    C   20  A   0   0   C   20  1   G   0   0   T   0   0

The third column can be A, G, C or T.

I would like to:

When this is done for the entire file, there would only be 4 columns left in some cases and 7 columns in other cases, like in the following example:

22    51169729    G   39  T   1   0.025641
22    51169730    A   36  
22    51169731    C   39  
22    51169732    G   37  
22    51169733    G   33  
22    51169734    C   35  
22    51169735    A   32  
22    51169736    G   32  
22    51169737    C   30  
22    51169738    T   27  
22    51169739    G   26  
22    51169740    A   25  
22    51169741    C   22  
22    51169742    G   23  
22    51169743    C   21  
22    51169744    C   22  
22    51169745    C   19  
22    51169746    C   19  
22    51169747    A   15  C   2   0.133333    
22    51169748    C   20  

Any suggestions?

Upvotes: 0

Views: 114

Answers (2)

Ed Morton
Ed Morton

Reputation: 204731

Here's one way to do the first part, assuming no empty fields:

$ cat tst.awk
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
{ gsub(/[[:space:]]+/,"\t"); print }

$ awk -f tst.awk file
1       957584  C       157     A       1       0.006   G       0       0       T       0       0

I don't really understand what you're trying to do in the 2nd part but it sounds like this might be what you want if the test on $7/10/13 is the modified field numbers after the first phase:

$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
{ $0=$0 }
$7  ~ /0/ { c++ }
$10 ~ /0/ { c++ }
$13 ~ /0/ { c++ }
c > 1 { $8=$9=$10="" }
{ c=0; gsub(/[[:space:]]+/,"\t"); print }

or this if the test on $7/10/13 is the original field numbers:

$7  ~ /0/ { c++ }
$10 ~ /0/ { c++ }
$13 ~ /0/ { c++ }
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
c > 1 { $8=$9=$10="" }
{ c=0; gsub(/[[:space:]]+/,"\t"); print }

If not, edit your question to clarify with a better example.

Upvotes: 0

choroba
choroba

Reputation: 242443

Perl solution for the first part:

#!/usr/bin/perl
use warnings;
use strict;

my %remove = ( A => 4,                # Where to start removing the columns
               C => 7,                # for a given character in column #3.
               G => 10,
               T => 13,
             );

$\ = "\n";                            # Add newline to prints.
$, = "\t";                            # Separate values by tabs.

while (<>) {                          # Read input line by line;
    chomp;                            # Remove newline.
    my @F = split /\t/;               # Split on tabs, populate an array.
    splice @F, $remove{ $F[2] }, 3;   # Remove the columns.
    print @F;                         # Output.
}

Once you clarify the second requirement, I can try to add more code. What values do you want to remove? Can you show more examples?

Upvotes: 1

Related Questions