Extracting the Nth subgroup within a regex match

Question

Thanks so much for taking the time to read this. I'm still pretty new to Perl, so any help is appreciated!

I am trying to use a regular expression to extract a piece of text from a large set of large documents.

I have a regular expression that I use to identify where in the larger document I want to start extracting. The conditions of this regular expression are such that there often are multiple instances that match the regular expression. I am able to identify which of these matches is the start of the body of text I want to extract. (in the example below this would be $finds[2].

What I would like to do is to run the same regular expression again with a .*?$END added to it to extract the text where $END identifies the end. But what I need is a way to tell the regular expression to start extracting at the Nth occurrence of the $STAR.

Consider this:

my $sentence = 'A1Z blah blah A2Z blah blah A3Z blah A4Z END A5Z';
my @finds = $sentence =~ m/(A\dZ)/mg;

####################
##  Code that determine the element of @finds that 
## contains the match to the extraction I want.
## For this question assume it is the third match (A3Z), 
## Element index number 2.
####################

$START = 2;

Here are my attempts:

my @finds2 = ($sentence =~ m/((A\dZ){$START}.*?(END))/mg);

my @finds2 = ($sentence =~ m/((A\dZ)[$START].*?(END))/mg);

I would like it if the {$START} or the [$START] indicated to PERL to wait until it had the "$START"th match to start extracting and continue matching.

I know that my attempts aren't correct. Hopefully they help indicate what I am trying to do.

amon · Accepted Answer

Does this do something you like?

my $pos = 3
my $END = "END";
my $a = "A1Z blah blah A2Z blah blah A3Z blah A4Z END A5Z";
$a =~ / (?:.*?A\dZ){$pos} (.*?) $END /x;
print $1, "
" if defined $1;'
# prints " blah A4Z "

This code will look for the n-th occurence of the A\dZ pattern (number specified in $pos) and start saving afterwards into $1 until the pattern in $END is encountered. If you really need performance, I would suggest looking into the \G assertion, which will match where your previous match left of. This can be mixed with the built-in soubroutine pos. Preventing "backtracking" can also improve performance, but this is an advanced topic I don't know too much about.

Suggested Readings: "perlop - Regexp Quote-Like Operators", "perlre - Assertions" and "perldoc -f pos".

(Another possibility might be splitting your input into smaller strings, but in many cases the simplest Perl solution is also the best.)

Extracting the Nth subgroup within a regex match

Answers (1)

Related Questions