mkHun
mkHun

Reputation: 5927

Why negative look ahead not consider in backtracking?

Suppose I have a string like this:

121456 word123word456word897 10:10

My condition is to check the 10:10 at the end of the string. So I will write the pattern as follows:

$s = "121456    word123word456word897   10:10";
if($s =~m/\d+(.+)(?=10:10)/)
{
    print "$1\n";  #
    print "Hello ";
}

Here .+ match up to end, then it will backtrack for to match 10:10. And $1 contains word123word456word897

But the problem is, I write the condition with negative look ahead, but the regex search engine can't backtrack.

if($s =~m/\d+(.+)(?!10:10)/)
{
    print "$1\n";
    print "Hello ";
}

Here $1 contains word123word456word897 10:10. And negative look ahead is not consider in backtracking. What is the problem of this negative look ahead?

Upvotes: 2

Views: 605

Answers (4)

ikegami
ikegami

Reputation: 386501

(?!10:10) is checked.

          1         2         3
0123456789012345678901234567890123456

121456 word123word456word897   10:10

\____/\____________________________/|
  \d+               .+              (?!10:10)

In words,

  • The whole pattern matches, because
    • \d+ matches 6 characters at position 0.
    • .+ matches 30 characters at position 6.
    • (?!10:10) matches 0 characters at position 36, because
      • 10:10 doesn't match at position 36.

What you actually want is

if (my $match = /( \d+ (?:(?!10:10).)* )/sx) {
   print("$match\n");
}

(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR.

Upvotes: 2

Michael Carman
Michael Carman

Reputation: 30851

Regular expressions try really hard to match. If there's any way to match your input with a particular pattern the regex engine will find it. Backtracking only occurs on a failure -- that is, when necessary in order to try an alternative in order to make the pattern match. Look-arounds can cause backtracking but negative look-arounds are tricky. They don't behave in the way many people intuitively expect.

Both of your sample cases start the same way:

  • The \d+ matches 121456
  • The .+ matches everything else

For your positive-lookahead case the (?=10:10) can't match when there's no more input so the engine backtracks until it can.

For your negative-lookahead case the (?!10:10) can match at the end of the input because the next thing is '' (i.e. nothing) and nothing is not 10:10. The engine doesn't backtrack because it doesn't need to. The pattern was matched successfully.

Upvotes: 1

rock321987
rock321987

Reputation: 11042

NOTE :- I will explain this of my own understanding. All corrections are welcomed

Why is this happening here?

By default, the aim of a regex engine is to meet all the required conditions to find a match in a string. This is achieved via backtracking, simple matching and jumping to different saved states (usually supported by NFA engines) if current state fails to satisfy the regex condition.

Once all the conditions are met, the requirement is fulfilled and the engine stops checking for any other thing. There is further no need of backtracking, matching or doing anything fancy because the requirements are already met.

Now coming back to your question, following is your string

121456    word123word456word897   10:10

In your first regex

\d+.+(?=10:10) 

i) \d+ matches all the digits <-- No Problem

ii) As .+ is greedy, it will match all the string and move to last <-- No problem

iii) To satisfy the next condition (?=10.10), there is no string left. So all the conditions are not fulfilled and hence to meet this condition, regex engine starts backtracking till it finds 10:10

In your second regex

\d+.+(?!10:10) 

i, ii) The first two steps are exact same as above

iii) To satisfy the next condition (?!10:10), whatever follows (here, we already have reached end of string or $ due to greediness of .+) should not match 10:10. It is obvious that end of string do not matches 10:10. Hence all of our condition is fulfilled. So there is no need of backtracking or doing anything at all because all our required conditions are met.

A picture is worth thousand words

For \d+.+(?=10:10)

enter image description here

For \d+.+(?!10:10)

enter image description here

Image credit :- https://regex101.com/

Upvotes: 3

JonM
JonM

Reputation: 1374

The .+ quantifier is greedy by default so the .+ part of the expression will match to the end of the string and assert that there is no 10:10 ahead therefore the expression will pass with no reason to need to backtrack.

The expression with the positive lookahead will also reach the end of the string. The difference is that the lookahead assertion will fail and the .+ part of the expression will back track until the assertion passes or there are no more branches of inquiry.

Upvotes: 2

Related Questions