ceving
ceving

Reputation: 23824

How to access a matching sequence in Perl?

The following expression

print Dumper "a.b.c.d" =~ /(.)(?:\.(.))*/

prints

$VAR1 = 'a';
$VAR2 = 'd';

Where are b and c? How to get them?

Update: Further simplification

The following expression

print Dumper "abcd" =~ /(.)+/

captures only the last character:

$VAR1 = 'd';

How to capture all characters?

Another update for those who do not believe me that I can neither use the global modifier nor split.

I would like to capture all digits in the following example.

print Dumper "x1234y" =~ /(\d)+/

The expression captures just the last digit:

$VAR1 = '4';

What I want is this:

@digits = ();
"x1234y" =~ /^x(:?(\d)(?{push @digits, $^N}))+y$/;
print Dumper @digits;

Which captures all digits:

$VAR1 = '1';
$VAR2 = '2';
$VAR3 = '3';
$VAR4 = '4';

The Perl documentation marks the code embedding feature with the following warning:

Using this feature safely requires that you understand its limitations.

And because I am not sure, whether I understand the limitations, I though there might be a simpler way to get the same. That is what this question is about.

Upvotes: 2

Views: 123

Answers (2)

user2404501
user2404501

Reputation:

I don't think there's a way to get a single capture group inside a repetition operator to produce multiple output strings. And you still haven't shown an example that isn't handled perfectly well by a simple capture of the whole sequence and then split afterward.

The (?{...}) solution isn't bad, especially in your example where the string is anchored at both ends, so there's no chance of it succeeding with any backtracking.

To work in a more general case, you should initialize the array just before the repetition starts, so you don't get leftovers from partial matches, like this:

@digits=();
"x8w x9z x1234y" =~ /x(?:(\d)(?{push @digits, $^N}))+y/;
# bad, @digits=(8,9,1,2,3,4)

"x88w x9z x1234y" =~ /x(?{@digits=()})(?:(\d)(?{push @digits, $^N}))+y/;
# better, @digits=(1,2,3,4)

Upvotes: 1

Sobrique
Sobrique

Reputation: 53478

They're not there because you have a * outside your capture group, and you're not repeating the pattern.

Thus what happens in the regex engine is:

Compiling REx "(.)(?:\.(.))*"
Final program:
   1: OPEN1 (3)
   3:   REG_ANY (4)
   4: CLOSE1 (6)
   6: CURLYX[1] {0,32767} (16)
   8:   EXACT <.> (10)
  10:   OPEN2 (12)
  12:     REG_ANY (13)
  13:   CLOSE2 (15)
  15: WHILEM[1/1] (0)
  16: NOTHING (17)
  17: END (0)
minlen 1 
Matching REx "(.)(?:\.(.))*" against "a.b.c.d"
   0 <> <a.b.c.d>            |  1:OPEN1(3)
   0 <> <a.b.c.d>            |  3:REG_ANY(4)
   1 <a> <.b.c.d>            |  4:CLOSE1(6)
   1 <a> <.b.c.d>            |  6:CURLYX[1] {0,32767}(16)
   1 <a> <.b.c.d>            | 15:  WHILEM[1/1](0)
                                    whilem: matched 0 out of 0..32767
   1 <a> <.b.c.d>            |  8:    EXACT <.>(10)
   2 <a.> <b.c.d>            | 10:    OPEN2(12)
   2 <a.> <b.c.d>            | 12:    REG_ANY(13)
   3 <a.b> <.c.d>            | 13:    CLOSE2(15)
   3 <a.b> <.c.d>            | 15:    WHILEM[1/1](0)
                                      whilem: matched 1 out of 0..32767
   3 <a.b> <.c.d>            |  8:      EXACT <.>(10)
   4 <a.b.> <c.d>            | 10:      OPEN2(12)
   4 <a.b.> <c.d>            | 12:      REG_ANY(13)
   5 <a.b.c> <.d>            | 13:      CLOSE2(15)
   5 <a.b.c> <.d>            | 15:      WHILEM[1/1](0)
                                        whilem: matched 2 out of 0..32767
   5 <a.b.c> <.d>            |  8:        EXACT <.>(10)
   6 <a.b.c.> <d>            | 10:        OPEN2(12)
   6 <a.b.c.> <d>            | 12:        REG_ANY(13)
   7 <a.b.c.d> <>            | 13:        CLOSE2(15)
   7 <a.b.c.d> <>            | 15:        WHILEM[1/1](0)
                                          whilem: matched 3 out of 0..32767
   7 <a.b.c.d> <>            |  8:          EXACT <.>(10)
                                            failed...
                                          whilem: failed, trying continuation...
   7 <a.b.c.d> <>            | 16:          NOTHING(17)
   7 <a.b.c.d> <>            | 17:          END(0)
Match successful!

Your non capturing bracket (?: is redundant as well.

How about:

print Dumper "a.b.c.d" =~ /(.)\.(.)/g;

Which prints:

$VAR1 = 'a';
$VAR2 = 'b';
$VAR3 = 'c';
$VAR4 = 'd';

Or alternatively (it wasn't entirely clear what you were seeking):

print Dumper "a.b.c.d" =~ /(.)\.(.+)/;


$VAR1 = 'a';
$VAR2 = 'b.c.d';

Update: Further simplification

print Dumper "abcd" =~ /(.)+/

How to capture all characters?

Again - the problem is - you're capturing a single character in your brackets, and then using + to alter how much of the pattern is 'consumed'. So you will only ever get a single character here, because you told it to.

If you want all of them, put the plus inside the brackets and you will get a single scalar.

print Dumper "abcd" =~ /(.+)/;

If you want each of them as separate elements:

print Dumper "abcd" =~ /(.)/g;

Repeat the capture operation. Or use split:

print Dumper split //, "abcd";

Edit: Following your last edit:

"x1234y" =~ /^x(:?(\d)(?{push @digits, $^N}))+y$/;

This still doesn't illustrate the the question you're asking properly, because this works:

my @digits = "x1234y" =~ m/(\d)/g; 
print Dumper \@digits;

And if it didn't, then this works:

my @digits = split ( //, "x1234y" =~ s/\D//rg  );

Or this:

my @digits = grep { /\d/ } split //, "x1234y";

Upvotes: 1

Related Questions