Reputation: 352
Thanks so much for taking the time to read this. I'm still pretty new to Perl, so any help is appreciated!
I am trying to use a regular expression to extract a piece of text from a large set of large documents.
I have a regular expression that I use to identify where in the larger document I want to start extracting. The conditions of this regular expression are such that there often are multiple instances that match the regular expression. I am able to identify which of these matches is the start of the body of text I want to extract. (in the example below this would be $finds[2].
What I would like to do is to run the same regular expression again with a .*?$END added to it to extract the text where $END identifies the end. But what I need is a way to tell the regular expression to start extracting at the Nth occurrence of the $STAR.
Consider this:
my $sentence = 'A1Z blah blah A2Z blah blah A3Z blah A4Z END A5Z';
my @finds = $sentence =~ m/(A\dZ)/mg;
####################
## Code that determine the element of @finds that
## contains the match to the extraction I want.
## For this question assume it is the third match (A3Z),
## Element index number 2.
####################
$START = 2;
Here are my attempts:
my @finds2 = ($sentence =~ m/((A\dZ){$START}.*?(END))/mg);
my @finds2 = ($sentence =~ m/((A\dZ)[$START].*?(END))/mg);
I would like it if the {$START} or the [$START] indicated to PERL to wait until it had the "$START"th match to start extracting and continue matching.
I know that my attempts aren't correct. Hopefully they help indicate what I am trying to do.
Upvotes: 1
Views: 1319
Reputation: 57600
Does this do something you like?
my $pos = 3
my $END = "END";
my $a = "A1Z blah blah A2Z blah blah A3Z blah A4Z END A5Z";
$a =~ / (?:.*?A\dZ){$pos} (.*?) $END /x;
print $1, "\n" if defined $1;'
# prints " blah A4Z "
This code will look for the n-th occurence of the A\dZ pattern (number specified in $pos
) and start saving afterwards into $1
until the pattern in $END is encountered. If you really need performance, I would suggest looking into the \G
assertion, which will match where your previous match left of. This can be mixed with the built-in soubroutine pos
. Preventing "backtracking" can also improve performance, but this is an advanced topic I don't know too much about.
Suggested Readings: "perlop - Regexp Quote-Like Operators", "perlre - Assertions" and "perldoc -f pos".
(Another possibility might be splitting your input into smaller strings, but in many cases the simplest Perl solution is also the best.)
Upvotes: 3