user4951834
user4951834

Reputation: 731

How to stop regex match if word is found?

I have text like this:

TEXT 786
OPQ RTS

APPENDIX A 

TITLE 

TEXT 123 
ABC EFG 

APPENDIX B

TEXT 456
HIJ KLM

and

TEXT 786
OPQ RTS

APPENDIX A 

TITLE 

TEXT 123 
ABC EFG 

TEXT 456
HIJ KLM

I'm trying to use regex to extract all the text starting from APPENDIX A to APPENDIX B if APPENDIX B is present otherwise from APPENDIX A until the end (i.e., HIJ KLM). Also, APPENDIX A must appear within 15 words before TITLE. This is what I'm come up with so far:

(\b(?:appendix)(?:.){0,15}(?:title)(?:.*)(?:appendix){0,1})/is

Problem is, the capture does not stop at APPENDIX B if APPENDIX B is there, it always captures until the end.

Upvotes: 2

Views: 461

Answers (3)

zdim
zdim

Reputation: 66899

One way is to use alternation for the optional part

perl -0777 -wlnE'
    @m = /(appendix .{0,15} title (?: .*?appendix\s\w+ | .*) )/xsig; 
    say for @m
' input.txt

with /g so to match all sections within appendix markers.

Or capture with multiple groups, one for the optional item, then test for it and use accordingly

perl -0777 -wne'
    @m = /(appendix .{0,15} title) (.*? appendix\s\w+)? (.*)/xsi;
    print join "", ($m[1] ? @m[0,1] : @m[0,2])
' input.txt

This works because $2 is created for the second ( even if there is no match.

With yet more capture groups you can filter in the second case, ? grep { defined } @m. If there may be multiple appendix-sections better use while with $N variables in this approach

while (/(appendix.{0,15}title)(.*?appendix\s\w+)?(.*)/sig) {
    my $appx_section = ($2) ? $1.$2 : $1.$3;
    ...
}

since one big @m with all captures would need a little analysis.

All these print the desired output in both cases, including multiple appendix-sections.

I've wrapped it in one-liners for ready testing. The code works in a Perl script as it stands.

Upvotes: 1

hoffmeister
hoffmeister

Reputation: 612

Just like this, where $var is your string.

if ( $var=~m#(APPENDIX A.{0,15}TITLE.*?(?:APPENDIX B|$))#s )
{
    print $1."\n";
}
else 
{

    print "failed\n";
}

your problem was this "(?:.)(?:appendix){0,1})" greedy matching plus a {0,1} which just means it will always just take the lot because regex is greedy. .? is non-greedy e.g., just take the minimum possible to still make the match

Upvotes: 0

felixmosh
felixmosh

Reputation: 35553

Look at this as an inspiration. Basically, I`ve splitted the text by line-break, then iterated on each line and converted it into blocks.

These blocks are what you want. :]

I'm not familiar with perl, but the idea should be the same.

Upvotes: 0

Related Questions