Reputation: 731
I have text like this:
TEXT 786
OPQ RTS
APPENDIX A
TITLE
TEXT 123
ABC EFG
APPENDIX B
TEXT 456
HIJ KLM
and
TEXT 786
OPQ RTS
APPENDIX A
TITLE
TEXT 123
ABC EFG
TEXT 456
HIJ KLM
I'm trying to use regex to extract all the text starting from APPENDIX A
to APPENDIX B
if APPENDIX B
is present otherwise from APPENDIX A
until the end (i.e., HIJ KLM
). Also, APPENDIX A
must appear within 15 words before TITLE
. This is what I'm come up with so far:
(\b(?:appendix)(?:.){0,15}(?:title)(?:.*)(?:appendix){0,1})/is
Problem is, the capture does not stop at APPENDIX B
if APPENDIX B
is there, it always captures until the end.
Upvotes: 2
Views: 461
Reputation: 66899
One way is to use alternation for the optional part
perl -0777 -wlnE'
@m = /(appendix .{0,15} title (?: .*?appendix\s\w+ | .*) )/xsig;
say for @m
' input.txt
with /g
so to match all sections within appendix
markers.
Or capture with multiple groups, one for the optional item, then test for it and use accordingly
perl -0777 -wne'
@m = /(appendix .{0,15} title) (.*? appendix\s\w+)? (.*)/xsi;
print join "", ($m[1] ? @m[0,1] : @m[0,2])
' input.txt
This works because $2
is created for the second (
even if there is no match.
With yet more capture groups you can filter in the second case, ? grep { defined } @m
. If there may be multiple appendix
-sections better use while
with $N
variables in this approach
while (/(appendix.{0,15}title)(.*?appendix\s\w+)?(.*)/sig) {
my $appx_section = ($2) ? $1.$2 : $1.$3;
...
}
since one big @m
with all captures would need a little analysis.
All these print the desired output in both cases, including multiple appendix
-sections.
I've wrapped it in one-liners for ready testing. The code works in a Perl script as it stands.
Upvotes: 1
Reputation: 612
Just like this, where $var is your string.
if ( $var=~m#(APPENDIX A.{0,15}TITLE.*?(?:APPENDIX B|$))#s )
{
print $1."\n";
}
else
{
print "failed\n";
}
your problem was this "(?:.)(?:appendix){0,1})" greedy matching plus a {0,1} which just means it will always just take the lot because regex is greedy. .? is non-greedy e.g., just take the minimum possible to still make the match
Upvotes: 0