neversaint
neversaint

Reputation: 64054

Perl regex extract parts of string with multiple condition

What's the single regex that enables me to capture all the text that goes after are genes and is gene from this text

The closest human genes of best are genes A B C
The closest human gene of best is gene A 

Hence I hope to extract $1 that contain

A B C
A 

Tried this but fail:

$line =~ /The closest .* gene[s] (.*)$/;

Upvotes: 1

Views: 4409

Answers (5)

Axeman
Axeman

Reputation: 29854

I think the most explicit is:

$line =~ m/best \s (?:is \s gene|are \s genes) \s ([\p{IsUpper}](?: \s [\p{IsUpper} ])*)/x;

Of course if you know that all sentences are going to be grammatical, then you can do the (?:are|is) thing. And if you know that you're only going to have genes A-N or something, you can forget the \p{IsUpper} and use [A-N].

Upvotes: 3

Donal Fellows
Donal Fellows

Reputation: 137727

Use non-greedy at the beginning to reduce the opportunities for surprises. Use non-capturing parens to group alternatives that you don't care about. Append ? to a letter to make it optional. Hence, try this:

$line =~ /The closest .*? (?:is|are) genes? (.*)$/;

To see where you were going wrong BTW, just compare the above with what you were originally trying.

Upvotes: 2

Space
Space

Reputation: 7259

With the other suggestions, I would like to suggest to have a look at the perllre for Regular Expressions

Upvotes: 0

ghostdog74
ghostdog74

Reputation: 342759

$ perl -F/genes*/ -ane 'print $F[-1];' file
 A B C
 A

Upvotes: 2

SilentGhost
SilentGhost

Reputation: 319881

$line =~ /The closest .* genes? (.*)$/;

Upvotes: 4

Related Questions