mkHun
mkHun

Reputation: 5927

What is the search flow of pattern match?

My string is

$s = "AAATAATAGCAV";

pattern 1

$s =~m/AT?.?A/g; 

Here T? fails to search. .? match the string second character (AAA) finally A matches (AAA)

pattern 2

$s =~m/A.?T?A/g

.? match the second character. T? fails to search. Results the same

Here is my doubt

$s =~m/A.?T?AA/

From beginning A matches first character from string

.? matches any one character match or not match from a string. So it is match the second character as pattern 1 and pattern 2.

T? also match any one character match or not match from a string.

AA match the AA character from a string.

Why above pattern won't match AATAA or ATAA. How the search engine works.? Why it is result AAA

Upvotes: 0

Views: 119

Answers (3)

Tim Potapov
Tim Potapov

Reputation: 443

I added an embedding code inside the regex to see where the engine is at as its searching. The key here is an understanding of the "bump-along".

Once a match is made the next possible match can only begin after the previous match ended.

Run while printing the progress

my $s = "AAATAATAGCAV";
my $n = 1;

while($s =~  
    m/

      # Embedded code construct (?{ })
      (?{print "\n", $n++, ". Starting match at: [$`||$']: "})  

      # Same as AT?.?A
      A T? .? A                                

   /gx  

){
   print "Found: [$&]";
}

Output with symbol || to denote the current place:

1. Starting match at: [||AAATAATAGCAV]: Found: [AAA]
2. Starting match at: [AAAT||AATAGCAV]: Found: [AA]
3. Starting match at: [AAATAAT||AGCAV]: 
4. Starting match at: [AAATAATAGC||AV]: 

Expounding on the steps in more details:

  1. Starting match at: [||AAATAATAGCAV]: Found: [AAA]

    • A: Start at the beginning.
    • B: 'A' is found (pass).
    • C: 'T?' optional 'T' (so still a pass).
    • D: '.?' any optional character (matches 2nd 'A' in the string) (found so its still a pass)
    • E: 'A' is found (pass).

Here is the progressing for the steps above:

  • A: String: ||AAATAATAGCAV. Regex: ||AT?.?A
  • B: String: |A|AATAATAGCAV. Regex: |A|T?.?A
  • C: String: |A|AATAATAGCAV. Regex: |AT?|.?A
  • D: String: |AA|ATAATAGCAV. Regex: |AT?.?|A
  • E: String: |AAA|TAATAGCAV. Regex: |AT?.?A|

2 Keys

  1. Since we included the \g modifier, the regex wants to find other matches.
  2. Since we already found a match, the next possible match must begin after the last match ended. In this case after the regex finds 'AAA'. The next match can only begin here: AAAT**||**AATAGCAV

Upvotes: 2

Sobrique
Sobrique

Reputation: 53478

There's a nice easy way to see what a regex is doing:

 use re 'debug';

E.g.:

#!/usr/bin/env perl 
use strict;
use warnings;

use re 'debug';

my $str = 'AAATAATAGCAV';    
$str =~m/A.?T?AA/;

This will print:

Compiling REx "A.?T?AA"
Final program:
   1: EXACT <A> (3)
   3: CURLY {0,1} (6)
   5:   REG_ANY (0)
   6: CURLY {0,1} (10)
   8:   EXACT <T> (0)
  10: EXACT <AA> (12)
  12: END (0)
anchored "A" at 0 floating "AA" at 1..3 (checking floating) minlen 3 
Matching REx "A.?T?AA" against "AAATAATAGCAV"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [1..12] gave 1
  Found floating substr "AA" at offset 1 (rx_origin now 0)...
  doing 'other' fbm scan, [0..1] gave 0
  Found anchored substr "A" at offset 0 (rx_origin now 0)...
  (multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
   0 <> <AAATAATAGC>         |  1:EXACT <A>(3)
   1 <A> <AATAATAGCA>        |  3:CURLY {0,1}(6)
                                  REG_ANY can match 1 times out of 1...
   2 <AA> <ATAATAGCAV>       |  6:  CURLY {0,1}(10)
                                    EXACT <T> can match 0 times out of 1...
   2 <AA> <ATAATAGCAV>       | 10:    EXACT <AA>(12)
                                      failed...
                                    failed...
   1 <A> <AATAATAGCA>        |  6:  CURLY {0,1}(10)
                                    EXACT <T> can match 0 times out of 1...
   1 <A> <AATAATAGCA>        | 10:    EXACT <AA>(12)
   3 <AAA> <TAATAGCAV>       | 12:    END(0)
Match successful!
Freeing REx: "A.?T?AA"

So to answer your question - this matched against the AAA at the start - because you have an optional (.?) which means zero is valid. And an optional T? which means none is valid there too.

It is therefore taking the first substring that matches your target pattern. By doing so it (by default!) consumes that piece of the pattern so it cannot match against it again. So to take your first example:

#!/usr/bin/env perl 
use strict;
use warnings;
use Data::Dumper;

use re 'debug';

my $s = 'AAATAATAGCAV';
my @matches = $s =~m/(AT?.?A)/g; 
print Dumper \@matches; 

Gives:

$VAR1 = [
          'AAA',
          'AA'
        ];

This puts the first match as AAA and the second as AA from your string: (AAA)T(AA)TAGCAV. Because the substring has matched, it's no longer 'available' to the pattern matching engine to use. You can do this if you really need to, but if you're going down that line I'd suggest thinking hard about what you're trying to do with your regex. But you can use "look around" matching.

Also: It's really bad form to use single char variable names generally. Call it something descriptive.

Upvotes: 2

svsd
svsd

Reputation: 1869

Regex: A.?T?AA
Text:  AAATAATAGCAV

A regular expression will start matching from the first character in the text and see if there's any way to match the pattern there. Also, ? is greedy, so it'll first try to match 1 character and if that fails, it'll match 0 characters. Let's see the how this translates to each step of the match:

Pattern: A.TAA
Text:    AAATAATAGCAV
           | Failed

The pattern failed while matching 'T' from 'T?', so now try to match 0 characters there.
Pattern: A.AA
Text:    AAATAATAGCAV
            | Failed

The pattern failed while matching the last 'A'. There's one more pattern now before moving on to the next character - the one where .? matches 0 characters.
Pattern: AAA
Text:    AAATAATAGCAV
         Success!

Upvotes: 1

Related Questions