Reputation: 307
I have the following input
ASR cAND text1 (p.Pro221Leu)
GMPPB cAND text2 c.1069G>A (p.Val357Ile)
KLHL40 cAND text3
GMPPB cAND text4 c.220C>T (p.Arg74Ter)
I want to print any text between bold ie between the words cAND and ((p or c)
Note: text3 is not expected since it fails to satisfy above condition.
expected output(underlined)
text1
text2
text4
regex used
grep "cAND.+(c\.|\(p)"
However I am not getting the expected output. Please tell me what is wrong in my Regex ?thanks
Upvotes: 1
Views: 55
Reputation: 41838
With grep in Perl mode, you can do this (see demo):
grep -P "cAND[ ]*\K\S+(?=[ ]*(?:c.|\(p))" some_path_or_files
How does it work? Greed.
cAND[ ]*
ensures we have the cAND
and also matches the following spaces\K
discards what we have matched so far, so that we can return clean strings such as text1
\S+
matches the characters we want: any non-space character(?=[ ]*(?:c.|\(p))
lookahead ensures that what follows is spaces and the c or p delimiterWhat was wrong?
.+
in your cAND.+(c\.|\(p)
is "greedy": it eats up all the characters until the end of the string, then it backtracks until the (c\.|\(p)
can be met. Therefore, it eats characters up to the last c or p, for instance: cAND text2 c.1069G>A (p
text1
as you wanted.Alternate Regex with Lookarounds
Since you're studying regex... This also works.
(?<=cAND).*?(?=c.|\(p)
Reference
The Many Degrees of Regex Greed
Upvotes: 4
Reputation: 174696
And the one through awk,
$ awk '$2=="cAND" && $4~/^c|^\(p/ { print $3}' file
text1
text2
text4
Checks for the column2 to be cAND
and also the column 4 to be starts with c
or (p
. If both conditions are satisfied, column 3 for that corresponding line would be printed.
Upvotes: 0
Reputation: 784968
Using sed -r
:
sed -r 's/^.*cAND ([^ ]+)( \(?[cp].*)?$/\1/' file
text1
text2
text3
text4
PS: Use sed -E
on OSX.
Upvotes: 1