Extract lines with multiple patterns occuring in one column using awk

Question

I have a quite large text file with genetic data (94,807,000 rows). I want to extract the rows in which specific patterns occur in a specific column. I tried using awk and grep in various ways but did not find a way to get the job done. The file is space-delimited and looks like this:

   V1     V2 V3 V4   V5      V6
1: 10 179406  T  . HPGM T,T,T,T
2: 10 179407  T  . HPGM T,T,T,T
3: 10 179408  G  . HPGM G,G,G,G
4: 10 179409  A  . HPGM A,A,A,A
5: 10 179410  A  . HPGM A,A,A,A
6: 10 179411  T  . HPGM T,T,T,T

V5 and V6 can have more then the four entries shown here and everything might look pretty weird, like:

   V1        V2 V3 V4   V5                    V6
1:  1 158154514  A  . HPGO A,AAAA..204..TTTT,A,A

I want to keep the lines where both entries for H and P (those are the first two comma-delimited characters in V6) are exactly either A, C, T or G, so should only have one of those four characters. H and P do not have to have the same character, though. In V5 multiple combinations can occur, but all start with HP. I am not interested if any or how many entries come afterwards and all rows do have entries for H and P, so I do not have to deal with missing entries.

I found some answers that show how to search for multiple patterns using logical or || , some that show how too look in a specific field using $6 ~ '/A,.' and how to look for exact matches using == "pattern". However, I did not find answers for combining these things and could not figure it out by myself. Help would be very much appreciated.

anubhava · Accepted Answer

You can use this awk command:

awk 'split($NF, a, /,/) && a[1] a[2] ~ /^[ACTG]{2}$/' file

1: 10 179406  T  . HPGM T,T,T,T
2: 10 179407  T  . HPGM T,T,T,T
3: 10 179408  G  . HPGM G,G,G,G
4: 10 179409  A  . HPGM A,A,A,A
5: 10 179410  A  . HPGM A,A,A,A
6: 10 179411  T  . HPGM T,T,T,T

split($NF, a, /,/) is splitting last column by comma
a[1] a[2] ~ /^[ACTG]{2}$/ is using a regex to ensure first and second sub-fields after split are one of A or C or T or G

Extract lines with multiple patterns occuring in one column using awk

Answers (1)

Related Questions