Reputation: 5927
My string is
$s = "AAATAATAGCAV";
pattern 1
$s =~m/AT?.?A/g;
Here T?
fails to search. .?
match the string second character (AAA) finally A
matches (AAA)
pattern 2
$s =~m/A.?T?A/g
.?
match the second character. T?
fails to search. Results the same
Here is my doubt
$s =~m/A.?T?AA/
From beginning A
matches first character from string
.?
matches any one character match or not match from a string. So it is match the second character as pattern 1 and pattern 2.
T?
also match any one character match or not match from a string.
AA
match the AA
character from a string.
Why above pattern won't match AATAA
or ATAA
. How the search engine works.? Why it is result AAA
Upvotes: 0
Views: 119
Reputation: 443
I added an embedding code inside the regex to see where the engine is at as its searching. The key here is an understanding of the "bump-along".
Once a match is made the next possible match can only begin after the previous match ended.
my $s = "AAATAATAGCAV";
my $n = 1;
while($s =~
m/
# Embedded code construct (?{ })
(?{print "\n", $n++, ". Starting match at: [$`||$']: "})
# Same as AT?.?A
A T? .? A
/gx
){
print "Found: [$&]";
}
1. Starting match at: [||AAATAATAGCAV]: Found: [AAA]
2. Starting match at: [AAAT||AATAGCAV]: Found: [AA]
3. Starting match at: [AAATAAT||AGCAV]:
4. Starting match at: [AAATAATAGC||AV]:
Starting match at: [||AAATAATAGCAV]: Found: [AAA]
Here is the progressing for the steps above:
Upvotes: 2
Reputation: 53478
There's a nice easy way to see what a regex is doing:
use re 'debug';
E.g.:
#!/usr/bin/env perl
use strict;
use warnings;
use re 'debug';
my $str = 'AAATAATAGCAV';
$str =~m/A.?T?AA/;
This will print:
Compiling REx "A.?T?AA"
Final program:
1: EXACT <A> (3)
3: CURLY {0,1} (6)
5: REG_ANY (0)
6: CURLY {0,1} (10)
8: EXACT <T> (0)
10: EXACT <AA> (12)
12: END (0)
anchored "A" at 0 floating "AA" at 1..3 (checking floating) minlen 3
Matching REx "A.?T?AA" against "AAATAATAGCAV"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [1..12] gave 1
Found floating substr "AA" at offset 1 (rx_origin now 0)...
doing 'other' fbm scan, [0..1] gave 0
Found anchored substr "A" at offset 0 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <AAATAATAGC> | 1:EXACT <A>(3)
1 <A> <AATAATAGCA> | 3:CURLY {0,1}(6)
REG_ANY can match 1 times out of 1...
2 <AA> <ATAATAGCAV> | 6: CURLY {0,1}(10)
EXACT <T> can match 0 times out of 1...
2 <AA> <ATAATAGCAV> | 10: EXACT <AA>(12)
failed...
failed...
1 <A> <AATAATAGCA> | 6: CURLY {0,1}(10)
EXACT <T> can match 0 times out of 1...
1 <A> <AATAATAGCA> | 10: EXACT <AA>(12)
3 <AAA> <TAATAGCAV> | 12: END(0)
Match successful!
Freeing REx: "A.?T?AA"
So to answer your question - this matched against the AAA
at the start - because you have an optional (.?
) which means zero is valid. And an optional T?
which means none is valid there too.
It is therefore taking the first substring that matches your target pattern. By doing so it (by default!) consumes that piece of the pattern so it cannot match against it again. So to take your first example:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use re 'debug';
my $s = 'AAATAATAGCAV';
my @matches = $s =~m/(AT?.?A)/g;
print Dumper \@matches;
Gives:
$VAR1 = [
'AAA',
'AA'
];
This puts the first match as AAA
and the second as AA
from your string: (AAA)T(AA)TAGCAV
. Because the substring has matched, it's no longer 'available' to the pattern matching engine to use. You can do this if you really need to, but if you're going down that line I'd suggest thinking hard about what you're trying to do with your regex. But you can use "look around" matching.
Also: It's really bad form to use single char variable names generally. Call it something descriptive.
Upvotes: 2
Reputation: 1869
Regex: A.?T?AA
Text: AAATAATAGCAV
A regular expression will start matching from the first character in the text and see if there's any way to match the pattern there. Also, ?
is greedy, so it'll first try to match 1 character and if that fails, it'll match 0 characters. Let's see the how this translates to each step of the match:
Pattern: A.TAA
Text: AAATAATAGCAV
| Failed
The pattern failed while matching 'T' from 'T?', so now try to match 0 characters there.
Pattern: A.AA
Text: AAATAATAGCAV
| Failed
The pattern failed while matching the last 'A'. There's one more pattern now before moving on to the next character - the one where .? matches 0 characters.
Pattern: AAA
Text: AAATAATAGCAV
Success!
Upvotes: 1