How to remove rows that match one of several regex patterns?

Question

I have a tab-delimited text file and wish to efficiently remove whole rows that fulfil either of the following criteria:

values in the ALT column that are equal to .
values in the NA00001 column and subsequent columns that have the same digit before and after either of the two delimiters, | or /, for e.g. 0|0, 1|1, 2/2 etc.

An example input file is below:

CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 0|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1110696 rs6040360 A      .     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4

Example output file is:

CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4

Michael Wojcik · Accepted Answer

Your example doesn't appear to include any lines that meet the "values in the ALT column that are equal to ." criterion, or lines that don't meet the second criterion (except the header line). So I added some lines of my own to your example for testing; I hope I've understood your criteria.

The first criterion is easily matched by testing the particular field, if we're using something like awk: $5 == "." {next} in an awk script would skip that line. Just using a regular expression is pretty simple too: ^[^^I]*^I[^^I]*^I[^^I]*^I[^^I]*^I\.^I, where ^I is a tab character, matches lines with just "." in the fifth (ALT) field.

With strict regular expressions you can't express "the same digit before and after [a delimiter]" directly. You have to do it with alternation of sub-expressions with specific values: 0[|/]0|1[|/]1|2[|/]2... But there are only 10 digits, so this isn't particularly burdensome. So, for example, you can do this filtering with one long egrep command line:

egrep -v '^[^^I]*^I[^^I]*^I[^^I]*^I[^^I]*^I\.^I|0[|/]0|1[|/]1|2[|/]2|3[|/]3|4[|/]4|5[|/]5|6[|/]6|7[|/]7|8[|/]8|9[|/]9' input-file

Obviously that's not something you'd want to type by hand on a regular basis, and isn't ideal for maintenance. A little awk script is better:

#! /usr/bin/awk -f
# Skip lines with "." in the fifth (ALT) field
$5 == "." {next}
# Skip lines with the same digit before and after the delimiter in any field
/0[|/]0/ {next}
/1[|/]1/ {next}
/2[|/]2/ {next}
/3[|/]3/ {next}
/4[|/]4/ {next}
/5[|/]5/ {next}
/6[|/]6/ {next}
/7[|/]7/ {next}
/8[|/]8/ {next}
/9[|/]9/ {next}

# Copy all other lines to the output
{print}

I've put the individual digit checks as separate awk statements for readability.

With extended regular expressions (EREs), you can express "same character before and after the delimiter" directly, using a back-reference. Backreferences should be used with caution, since they can create pathological performance characteristics; and, of course, you'll have to use a language that supports them, such as perl. POSIX awk and Gnu gawk don't. Here's a Perl one-liner that handles the second criterion:

LINE: while () { next LINE if /(\d)[|\/]\g1/; print }

That's probably not very good Perl - I almost never use the language - but it works in my testing. The (\d) matches and remembers the digit before the delimiter, and the \g1 matches the remembered digit after the delimiter.

How to remove rows that match one of several regex patterns?

Answers (2)

Related Questions