Reputation: 64054
What's the single regex that enables me to capture
all the text that goes after are genes
and is gene
from this text
The closest human genes of best are genes A B C
The closest human gene of best is gene A
Hence I hope to extract $1
that contain
A B C
A
Tried this but fail:
$line =~ /The closest .* gene[s] (.*)$/;
Upvotes: 1
Views: 4409
Reputation: 29854
I think the most explicit is:
$line =~ m/best \s (?:is \s gene|are \s genes) \s ([\p{IsUpper}](?: \s [\p{IsUpper} ])*)/x;
Of course if you know that all sentences are going to be grammatical, then you can do the (?:are|is)
thing. And if you know that you're only going to have genes A-N or something, you can forget the \p{IsUpper}
and use [A-N]
.
Upvotes: 3
Reputation: 137727
Use non-greedy at the beginning to reduce the opportunities for surprises. Use non-capturing parens to group alternatives that you don't care about. Append ?
to a letter to make it optional. Hence, try this:
$line =~ /The closest .*? (?:is|are) genes? (.*)$/;
To see where you were going wrong BTW, just compare the above with what you were originally trying.
Upvotes: 2
Reputation: 7259
With the other suggestions, I would like to suggest to have a look at the perllre for Regular Expressions
Upvotes: 0