Reputation: 3
I have the following file
1:10177 rs367896724 A AC
1:10352 rs555500075 T TA
1:10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C
1:11012 rs544419019 C G
1:11063 rs561109771 T G
1:13110 rs540538026 G A
1:13116 rs62635286 T G
1:13118 rs62028691 A G
1:13273 rs531730856 G C
1:13284 rs548333521 GT A
Where the last two columns can have only values [ATCG]. I want to grep all the lines where I have only a letter in the last two columns
Expected output: I have the following file
1:11012 rs544419019 C G
1:11063 rs561109771 T G
1:13110 rs540538026 G A
1:13116 rs62635286 T G
1:13118 rs62028691 A G
1:13273 rs531730856 G C
I've tried the following but I got no results
grep -F '[ACTG]?\s[ACTG]?$' file | head
grep '[ACTG]?\s[ACTG]?$' file | head
grep -E '.?\s.?$' file
With the last command, I got the following:
1:10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C
1:11012 rs544419019 C G
1:11063 rs561109771 T G
1:13110 rs540538026 G A
1:13116 rs62635286 T G
1:13118 rs62028691 A G
1:13273 rs531730856 G C
1:13284 rs548333521 G A
Thanks for the help!
Upvotes: 0
Views: 42
Reputation: 41460
Some like this?
awk '!(length($NF)>1 || length($(NF-1))>1)' file
1:11012 rs544419019 C G
1:11063 rs561109771 T G
1:13110 rs540538026 G A
1:13116 rs62635286 T G
1:13118 rs62028691 A G
1:13273 rs531730856 G C
Only print filed where length of last and second last field is not larger than 1.
A shorter version:
awk 'length($NF$(NF-1))==2' file
awk 'length($3$4)==2' file
To test only for ACTG
awk '$NF$(NF-1)~/^[ACTG]{2}$/' file
awk '$3$4~/^[ACTG]{2}$/' file
Upvotes: 0
Reputation: 339
If you want exactly one character in the last two columns use a leading whitespace character. From your description it sounds like there shouldn't be any optional characters either.
grep -E '\s.\s.$' file
Or
grep -E '(\s[ACTG]){2}$' file
Either should work.
Upvotes: 2