Reputation: 127
I have a tab-delimited text file and wish to efficiently remove whole rows that fulfil either of the following criteria:
ALT
column that are equal to .
NA00001
column and subsequent columns that have the same digit before and after either of the two delimiters, |
or /
, for e.g. 0|0
, 1|1
, 2/2
etc.An example input file is below:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 0|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1110696 rs6040360 A . 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
Example output file is:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
Upvotes: 0
Views: 206
Reputation: 5347
perl -alnE '$F[4] eq "." and
$F[9] =~ m!(\d)[|/]\1! and
$F[10] =~ m!(\d)[|/]\1! and
say'
Update: Sorry the OP ask for the oposite...
perl -alnE 'say unless (
$f[4] eq "." or
( $F[9] =~ m!(\d)[|/]\1! and
$F[10] =~ m!(\d)[|/]\1! and
$F[11] =~ m!(\d)[|/]\1!
)
)'
or equivalent
perl -ane 'next if ( $f[4] eq ".");
next if ( $F[9] =~ m!(\d)[|/]\1! and
$F[10] =~ m!(\d)[|/]\1! and
$F[11] =~ m!(\d)[|/]\1! );
print '
Upvotes: 1
Reputation: 158
Your example doesn't appear to include any lines that meet the "values in the ALT
column that are equal to .
" criterion, or lines that don't meet the second criterion (except the header line). So I added some lines of my own to your example for testing; I hope I've understood your criteria.
The first criterion is easily matched by testing the particular field, if we're using something like awk: $5 == "." {next}
in an awk script would skip that line. Just using a regular expression is pretty simple too: ^[^^I]*^I[^^I]*^I[^^I]*^I[^^I]*^I\.^I
, where ^I
is a tab character, matches lines with just "." in the fifth (ALT) field.
With strict regular expressions you can't express "the same digit before and after [a delimiter]" directly. You have to do it with alternation of sub-expressions with specific values: 0[|/]0|1[|/]1|2[|/]2
... But there are only 10 digits, so this isn't particularly burdensome. So, for example, you can do this filtering with one long egrep command line:
egrep -v '^[^^I]*^I[^^I]*^I[^^I]*^I[^^I]*^I\.^I|0[|/]0|1[|/]1|2[|/]2|3[|/]3|4[|/]4|5[|/]5|6[|/]6|7[|/]7|8[|/]8|9[|/]9' input-file
Obviously that's not something you'd want to type by hand on a regular basis, and isn't ideal for maintenance. A little awk script is better:
#! /usr/bin/awk -f
# Skip lines with "." in the fifth (ALT) field
$5 == "." {next}
# Skip lines with the same digit before and after the delimiter in any field
/0[|/]0/ {next}
/1[|/]1/ {next}
/2[|/]2/ {next}
/3[|/]3/ {next}
/4[|/]4/ {next}
/5[|/]5/ {next}
/6[|/]6/ {next}
/7[|/]7/ {next}
/8[|/]8/ {next}
/9[|/]9/ {next}
# Copy all other lines to the output
{print}
I've put the individual digit checks as separate awk statements for readability.
With extended regular expressions (EREs), you can express "same character before and after the delimiter" directly, using a back-reference. Backreferences should be used with caution, since they can create pathological performance characteristics; and, of course, you'll have to use a language that supports them, such as perl. POSIX awk and Gnu gawk don't. Here's a Perl one-liner that handles the second criterion:
LINE: while (<STDIN>) { next LINE if /(\d)[|\/]\g1/; print }
That's probably not very good Perl - I almost never use the language - but it works in my testing. The (\d)
matches and remembers the digit before the delimiter, and the \g1
matches the remembered digit after the delimiter.
Upvotes: 2