Reputation: 111
I have a quite large text file with genetic data (94,807,000 rows). I want to extract the rows in which specific patterns occur in a specific column. I tried using awk and grep in various ways but did not find a way to get the job done. The file is space-delimited and looks like this:
V1 V2 V3 V4 V5 V6
1: 10 179406 T . HPGM T,T,T,T
2: 10 179407 T . HPGM T,T,T,T
3: 10 179408 G . HPGM G,G,G,G
4: 10 179409 A . HPGM A,A,A,A
5: 10 179410 A . HPGM A,A,A,A
6: 10 179411 T . HPGM T,T,T,T
V5 and V6 can have more then the four entries shown here and everything might look pretty weird, like:
V1 V2 V3 V4 V5 V6
1: 1 158154514 A . HPGO A,AAAA..204..TTTT,A,A
I want to keep the lines where both entries for H and P (those are the first two comma-delimited characters in V6
) are exactly either A, C, T or G, so should only have one of those four characters. H and P do not have to have the same character, though. In V5
multiple combinations can occur, but all start with HP
. I am not interested if any or how many entries come afterwards and all rows do have entries for H and P, so I do not have to deal with missing entries.
I found some answers that show how to search for multiple patterns using logical or || , some that show how too look in a specific field using $6 ~ '/A,.'
and how to look for exact matches using == "pattern"
. However, I did not find answers for combining these things and could not figure it out by myself. Help would be very much appreciated.
Upvotes: 0
Views: 355
Reputation: 785541
You can use this awk command:
awk 'split($NF, a, /,/) && a[1] a[2] ~ /^[ACTG]{2}$/' file
1: 10 179406 T . HPGM T,T,T,T
2: 10 179407 T . HPGM T,T,T,T
3: 10 179408 G . HPGM G,G,G,G
4: 10 179409 A . HPGM A,A,A,A
5: 10 179410 A . HPGM A,A,A,A
6: 10 179411 T . HPGM T,T,T,T
split($NF, a, /,/)
is splitting last column by commaa[1] a[2] ~ /^[ACTG]{2}$/
is using a regex to ensure first and second sub-fields after split are one of A or C or T or G
Upvotes: 1