AnonGeek
AnonGeek

Reputation: 7938

Unable to get correct match start and end offsets

I have a regex as below:

$regex = qr/(?sx-im:(?sx-im:(?:^|(?<=\n)))(?=(?sx-im:[\ \t]*)(?sx-im:(?:^|(?<=\n))Data\ and\ value)(?sx-im:[\ \t\r]*(?:$|\n))))/;

I am matching it against following text:

$text ="Data and value";

Now I want to get the match-start offset, match-end offset and matched text.

Normally I use @-, @+ and $& to get these like below:

if($text  =~ m/$regex/) 
{
        print "START Offset = ".$-[0];
        print "END Offsset  = ".$+[0];
        print "Matched Text = ".$&;
}

In this case, match is successful but I am not able to get correct offsets and matched text. Its just printing 0 as both match-start offset and match-end offset. And its printing empty for matched-text.

I want to understand different components of this regex. Specifically what is this (?sx-im:, and how to get matched text.

Please don't ask me the reason for such regex or suggest me to change the regex. This is a software generated regex. I have simplified my problem for the sake of question.

Please guide me where to start understanding this regex and get match offsets.

Upvotes: 0

Views: 144

Answers (2)

pndc
pndc

Reputation: 3795

The bug is in your regex, not your understanding of match offsets. It is matching a zero-width string at the start of the string, and correctly reporting start and end offsets of 0.

Now why it matches this is another good question. You can split the regex thus (untested):

qr/(?sx-im:
  (?sx-im:(?:^|(?<=\n)))
  (?=(?sx-im:[\ \t]*)(?sx-im:(?:^|(?<=\n))Data\ and\ value)(?sx-im:[\ \t\r]*(?:$|\n)))
)/x

And you can see the two sequential halves of it:

  • The first matches the start of line or a lookbehind match of \n - i.e. both are zero-width.
  • the second is a lookahead match of a whole load of stuff, but again is a zero-width match.

You appear to be trying to do too much with a regex, in particular matching the start and end of lines. Consider reading your source file line-by-line and processing individual lines rather than trying to do it all with a regex.

Upvotes: 4

choroba
choroba

Reputation: 241828

(?: ... ) is a non-capturing group. It does not create a backreference.

Similarly, (?= ... ) is a zero-width look-ahead assertion. It does not include the matching string into $&.

See Extended Patterns.

Upvotes: 4

Related Questions