Filippo Abbondanza
Filippo Abbondanza

Reputation: 3

Problems with mulitpliers using grep

I have the following file

1:10177 rs367896724 A AC
1:10352 rs555500075 T TA
1:10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C
1:11012 rs544419019 C G
1:11063 rs561109771 T G
1:13110 rs540538026 G A
1:13116 rs62635286 T G
1:13118 rs62028691 A G
1:13273 rs531730856 G C
1:13284 rs548333521 GT A

Where the last two columns can have only values [ATCG]. I want to grep all the lines where I have only a letter in the last two columns

Expected output: I have the following file

1:11012 rs544419019 C G
1:11063 rs561109771 T G
1:13110 rs540538026 G A
1:13116 rs62635286 T G
1:13118 rs62028691 A G
1:13273 rs531730856 G C

I've tried the following but I got no results

grep -F '[ACTG]?\s[ACTG]?$' file | head

grep '[ACTG]?\s[ACTG]?$' file | head

grep -E '.?\s.?$' file

With the last command, I got the following:

1:10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C
1:11012 rs544419019 C G
1:11063 rs561109771 T G
1:13110 rs540538026 G A
1:13116 rs62635286 T G
1:13118 rs62028691 A G
1:13273 rs531730856 G C
1:13284 rs548333521 G A

Thanks for the help!

Upvotes: 0

Views: 42

Answers (2)

Jotne
Jotne

Reputation: 41460

Some like this?

awk '!(length($NF)>1 || length($(NF-1))>1)' file
1:11012 rs544419019 C G
1:11063 rs561109771 T G
1:13110 rs540538026 G A
1:13116 rs62635286 T G
1:13118 rs62028691 A G
1:13273 rs531730856 G C

Only print filed where length of last and second last field is not larger than 1.

A shorter version:

awk 'length($NF$(NF-1))==2' file
awk 'length($3$4)==2' file

To test only for ACTG

awk '$NF$(NF-1)~/^[ACTG]{2}$/' file
awk '$3$4~/^[ACTG]{2}$/' file

Upvotes: 0

Jonathon K
Jonathon K

Reputation: 339

If you want exactly one character in the last two columns use a leading whitespace character. From your description it sounds like there shouldn't be any optional characters either.

grep -E '\s.\s.$' file

Or

grep -E '(\s[ACTG]){2}$' file

Either should work.

Upvotes: 2

Related Questions