Wes
Wes

Reputation: 525

Finding repeating tagged substrings

I have a file where the lines are made up of fields that are:

An example line:

%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

Sets of tags are significant for searching -- This is the tag set for my example:

%t, %u, %v, %w, %x, %xx, %y, %z

I want to find the content of fields where the tag is in the set and the field content is repeated in a subsequent field tagged from the set. Here is the code of my unsuccessful attempt:

my $tagmrkr='%';
my $line='%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff';

my $searchtags = qr/t|u|v|w|x|xx|y|z/; # excludes q

print qq/The line:$line\n\n/;
for ($line =~ m/
    $tagmrkr$searchtags\ ([^\,]*,)
    .*?
    $tagmrkr$searchtags\ \1
    /gx) {
        print qq/First field contents:$1\n/;
        print qq/Entire match:$&\n/;
        print qq/\n/;
        }

I was expecting:

The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

First field contents:this,
Entire match:%t this,%u that,%v this,

First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,

I got:

The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,

First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,

Question 1:
Why is the $1 and $& for first match being replaced by the values from the second match?

Question 2: -- What should I change to get what I want (below) not what I expect?

What I want is to be able to re-pivot the match so that it also finds the repeated field in spite of overlaps -- where the first field of the second match occurs before the second field of the first match. Actually, for my immediate purposes, all I need is the duplicated field content.

I.e., I want 3 matches from the example:

The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

First field contents:this
Entire match:%t this,%u that,%v this,

First field contents:that
Entire match:%u that,%v this,%t that,

First field contents:the other
Entire match:%x the other,%xx only once,%q the other,%z the other,

Upvotes: 3

Views: 110

Answers (2)

zdim
zdim

Reputation: 66881

One way to provide for overlaps is to assert the presence of the rest of the phrase, using lookahead. Then that part is not consumed and the engine continues from before it and so it can match it again

use warnings;
use strict;
use feature 'say';

my $s = q(%a astuff,%b bstuff,%t this,%u that,%v this,%t that,)
      . q(%x the other,%xx only once,%q the other,%z the other,%c cstuff); 

my $m = qr/%/;
my $t = qr/(?:t|u|v|w|x|xx|y|z)/; 

while ($s =~ / $m$t \s ([^,]+) , (?=(.*?$m$t\s\g{1},?)) /gx) { 
    say "capture: $1";
    say "  whole: $1,$2";
}

For a more detailed explanation of how the lookahead helps in catching overlapping patterns see this post

Prints

capture: this
  whole: this,%u that,%v this,
capture: that
  whole: that,%v this,%t that,
capture: the other
  whole: the other,%xx only once,%q the other,%z the other,

Upvotes: 2

Håkon Hægland
Håkon Hægland

Reputation: 40758

Using a global match in a for loop will return all matches at once (and then iterates over the matches), hence the match variables will be set to the last successful match (before starting the iteration), whereas using the global regexp match in a while condition evaluates it in scalar context such that the match variables will be correct for each iteration.

You can get all three matches by resetting pos $line for each iteration. E.g. using the following approach:

while ($line =~ m/
      $tagmrkr$searchtags\ ([^\,]*,)
      .*?
      $tagmrkr$searchtags\ \1
   /gx) {
    pos $line = $-[0] + 1;
    print qq/First field contents:$1\n/;
    print qq/Entire match:$&\n/;
    print qq/\n/;
}

Output:

The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

First field contents:this,
Entire match:%t this,%u that,%v this,

First field contents:that,
Entire match:%u that,%v this,%t that,

First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,

Upvotes: 0

Related Questions