Reputation: 7610

How can I detect how many capture groups are in a Perl Regexp?

I have a bunch of perl regexps in a script. I would like to know how many capture groups are in them. More precisely I'd like to know how many items would be added to the @- and @+ arrays if they matched before actually use them in a real match op.

An example:

'XXAB(CD)DE\FG\XX' =~ /(?i)x(ab)\(cd\)(?:de)\\(fg\\)x/
    and print "'@-', '@+'\n";

In this case the output is:

'1 2 11', '15 4 14'

So after matching I know that the 0th item is the matched part of the string, and there are two capture group expressions. Would it be possible to know right before the actual match?

I tried to concentrate onto the opening brackets. So I removed the '\\' patterns first to make easier to detect the escaped brackets. Then I removed '\(' strings. Then came '(?'. Now I can count the remaining opening brackets.

my $re = '(?i)x(ab)\(cd\)(?:de)\\\\(fg\\\\)x'; print "ORIG: '$re'\n";
'XXAB(CD)DE\FG\XX' =~ /$re/ and print "RE: '@-', '@+'\n";
$re =~ s/\\\\//g; print "\\\\: '$re'\n";
$re =~ s/\\\(//g; print "\\(: '$re'\n";
$re =~ s/\(\?//g; print "\\?: '$re'\n";
my $n = ($re =~ s/\(//g); print "n=$n\n";

Output:

ORIG: '(?i)x(ab)\(cd\)(?:de)\\(fg\\)x'
RE: '1 2 11', '15 4 14'
\\: '(?i)x(ab)\(cd\)(?:de)(fg)x'
\(: '(?i)x(ab)cd\)(?:de)(fg)x'
\?: 'i)x(ab)cd\):de)(fg)x'
n=2

So here I know that 2 capture groups are in this regexp. But maybe there is an easier way and this is definitely not complete (e.g. this treats (?<foo>...) and (?'foo'...) as a non-caputre groups).

Another way would be to dump the internal data structures of regcomp function. Maybe the package Regexp::Debugger could solve the issue, but I have no right to install packages in my environment.

Actually the regexps are keys to some ARRAY refs and I'd like to check if the referenced ARRAY contains the proper amount of values before actually applying the regexps. Of course this checking can be done right after the pattern matching, but it would be nicer if I could do it in the loading stage of the script.

Thank you for your help and comments in advance!

Upvotes: 7

Answers (3)

TrueY

Reputation: 7610

As Mr. Obama said: "Yes We Can!"

I found a solution which does not require additional module and handles all the possible capturing group events (as I know). As ikegami mentioned it needs reparsing of the regexp, but perl does it for us.

During digging in the haystack of Perl modules on CPAN I found a very interesting one called warnings::regex::recompile. It generates a warning message each time a regexp is re-compiled. Analyzing the source I found the solution of my problem.

Using use re qw/Debug DUMP/; Perl returns the parsed regex to STDERR. In the original module result is dumped to a real file and then reread to process. I modified the code to use in-memory file.

My solution is:

sub dumpre {
  use re qw(eval Debug DUMP);
  my $buf = ''; 

  open OLDERR, '>&', STDERR or die "$!";
  close STDERR or die "$!";
  open STDERR, '>', \$buf or die "$!";

  my $re = qr/$_[0]/;

  close STDERR or die "$!";
  open STDERR, '>&', OLDERR or die "$!";
  close OLDERR or die "$!";

  no re 'debug'; # Needed because of split

  return [ split '\n', $buf ];
}

This function turns on DUMP when compiling a regexp. Enables eval to be able to handle (?{...}) and (??{...}) expressions.

my $re = 'aa(?:(a\d)+x)?((b\d)*d)*c*(d\d)?(e*)((f)+)(g)+';
my $r = dumpre $re;
print join "\n", @$r;

The result is:

Compiling REx "aa(?:(a\d)+x)?((b\d)*d)*c*(d\d)?(e*)((f)+)(g)+"
Final program:
   1: EXACT <aa> (3)
   3: CURLYX[0] {0,1} (19)
   5:   CURLYM[1] {1,32767} (16)
   9:     EXACT <a> (11)
  11:     POSIXU[\d] (14)
  14:     SUCCEED (0)
  15:   NOTHING (16)
  16:   EXACT <x> (18)
  18: WHILEM (0)
  19: NOTHING (20)
  20: CURLYX[1] {0,32767} (40)
  22:   OPEN2 (24)
  24:     CURLYM[3] {0,32767} (35)
  28:       EXACT <b> (30)
  30:       POSIXU[\d] (33)
  33:       SUCCEED (0)
  34:     NOTHING (35)
  35:     EXACT <d> (37)
  37:   CLOSE2 (39)
  39: WHILEM[1/7] (0)
  40: NOTHING (41)
  41: STAR (44)
  42:   EXACT <c> (0)
  44: CURLYM[4] {0,1} (55)
  48:   EXACT <d> (50)
  50:   POSIXU[\d] (53)
  53:   SUCCEED (0)
  54: NOTHING (55)
  55: OPEN5 (57)
  57:   STAR (60)
  58:     EXACT <e> (0)
  60: CLOSE5 (62)
  62: OPEN6 (64)
  64:   CURLYN[7] {1,32767} (74)
  66:     NOTHING (68)
  68:     EXACT <f> (0)
  72:   WHILEM (0)
  73:   NOTHING (74)
  74: CLOSE6 (76)
  76: CURLYN[8] {1,32767} (86)
  78:   NOTHING (80)
  80:   EXACT <g> (0)
  84: WHILEM (0)
  85: NOTHING (86)
  86: END (0)
anchored "aa" at 0 floating "fg" at 2..9223372036854775807 (checking floating) minlen 4

So the lines with OPEN\d+, CURLYM[\d+], CURLYN[\d+] shows capturing bracket expressions (Line syntax: segment_no: regex command (next segment)). (Note: CURLYX is a non-capturing bracket expression like (?:...)+). The number after OPEN/CURLY[MN} shows the ordinal number of capturing group. The last has to be found. Which is 8 in this case.

Unfortunately it does not handle if the (??{...}) returns a bracket expression, but this is not really needed for me now. I suppose the format is not fixed, so it can differ from version to version. But it is ok for me.

Upvotes: 0

SREagle

Reputation: 251

Without any limiting requirements for the occuring regexes, there is no definitive answer to the number of capture groups, I think. Just think of alternatives with a differing capture group count and the possibility of this occuring again in each branch:

my $re = qr/ A(B)C | A(D|(E(G+|H))F /x;

This regex can obviously have up to 3 capture groups. You could recursively parse each branch, and take the highest number as your result - but I honestly cannot come up with a practical way to do this in a short time. For 'linear' regexes not using alternatives or non-basic regex features, the task of determining the count of capture groups is possible, but I don't think it's feasible with more advanced ones.

Upvotes: 1

revo

Reputation: 48751

Regex:

\\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>]))

Explanation:

\\.                     # Match any escaped character
(*SKIP)(?!)             # Discard it
|                       # OR
\(                      # Match a single `(`
(?(?=\?)                # Which if is followed by `?`
    \?                      # Match `?`
    P?['<]\w+['>]           # Next characters should be matched as ?P'name', ?<name> or ?'name'
)                       # End of conditional statement

Perl:

my @offsets = ();
while ('XXAB(CD)DE\FG\X(X)' =~ /\\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>]))/g){
    push @offsets, "$-[0]";
}
print join(", ", @offsets);

Output:

4, 15

Which represents existence of two capturing groups in input string.

Upvotes: 1

How can I detect how many capture groups are in a Perl Regexp?

Answers (3)

Related Questions