dhvlnyk
dhvlnyk

Reputation: 307

Regex query to print required output

I have the following input

ASR cAND text1 (p.Pro221Leu)
GMPPB cAND text2 c.1069G>A (p.Val357Ile)
KLHL40 cAND text3
GMPPB cAND text4 c.220C>T (p.Arg74Ter)

I want to print any text between bold ie between the words cAND and ((p or c)
Note: text3 is not expected since it fails to satisfy above condition.

expected output(underlined) 
text1
text2
text4    

regex used
grep "cAND.+(c\.|\(p)" 

However I am not getting the expected output. Please tell me what is wrong in my Regex ?thanks

Upvotes: 1

Views: 55

Answers (3)

zx81
zx81

Reputation: 41838

With grep in Perl mode, you can do this (see demo):

grep -P "cAND[ ]*\K\S+(?=[ ]*(?:c.|\(p))" some_path_or_files

How does it work? Greed.

  • The cAND[ ]* ensures we have the cAND and also matches the following spaces
  • The \K discards what we have matched so far, so that we can return clean strings such as text1
  • The \S+ matches the characters we want: any non-space character
  • The (?=[ ]*(?:c.|\(p)) lookahead ensures that what follows is spaces and the c or p delimiter

What was wrong?

  1. The .+ in your cAND.+(c\.|\(p) is "greedy": it eats up all the characters until the end of the string, then it backtracks until the (c\.|\(p) can be met. Therefore, it eats characters up to the last c or p, for instance: cAND text2 c.1069G>A (p
  2. It was matching the whole string, not just text1 as you wanted.

Alternate Regex with Lookarounds

Since you're studying regex... This also works.

(?<=cAND).*?(?=c.|\(p)

Reference

The Many Degrees of Regex Greed

Upvotes: 4

Avinash Raj
Avinash Raj

Reputation: 174696

And the one through awk,

$ awk '$2=="cAND" && $4~/^c|^\(p/ { print $3}' file
text1
text2
text4

Checks for the column2 to be cAND and also the column 4 to be starts with c or (p. If both conditions are satisfied, column 3 for that corresponding line would be printed.

Upvotes: 0

anubhava
anubhava

Reputation: 784968

Using sed -r:

sed -r 's/^.*cAND ([^ ]+)( \(?[cp].*)?$/\1/' file
text1
text2
text3
text4

PS: Use sed -E on OSX.

Upvotes: 1

Related Questions