RobM
RobM

Reputation: 8965

Why is this regex acting greedy, when I've told it not to be?

I've attempted to encode the following EBNF grammar for my system's macros into a regular expression (below), but despite my best efforts it appears to greedily match across multiple macros: it doesn't stop at the closing }}@.

    Expand ``@{{...}}@`` references which may appear in Step parameters.
    The syntax is described by the following EBNF grammar::

        depdata = "@{{", source identifier, ":", attribute, "}}@"
                | "@{{TAGS:", expression, "}}@" ;
        source identifier = ? printable 7-bit ASCII ? ;
        attribute = "DATADIR" | "TAGSFILE" | "RESULT_INT" ;
        expression = ? printable 7-bit ASCII ? ;

and the Python regular expression I've come up with

@{{(?:(?:(?P<id>.*?):(?P<attr>DATADIR|TAGSFILE|RESULT_INT))|TAGS:(?P<expr>.+?))}}@

Edit: Added missing + within expr group

Regular expression visualization

Debuggex Demo

When finding all the matches in the following test-case, I expect the result to be three matches, but I only get two:

@{{TAGS:sTagsJob << "job||ID||source"}}@ test  @{{job:DATADIR}}@ email body @{{job:DATADIR}}@ blah

The matches I expect are:

  1. @{{TAGS:sTagsJob << "job||ID||source"}}@ with the expr group set
  2. @{{job:DATADIR}}@ with id and attr groups set
  3. @{{job:DATADIR}}@ (again) with id and attr groups set

But instead, the matches are:

  1. @{{TAGS:sTagsJob << "job||ID||source"}}@ test @{{job:DATADIR}}@
  2. @{{job:DATADIR}}@

How come the non-greedy matches (.+?) seem to be acting greedy? What have I missed?

(And yes, I'm aware the EBNF grammar is silly and could be improved by always having the fixed strings appear on the right hand side. But that's not my question: I want to learn why my regex-fu has failed me)

Upvotes: 3

Views: 211

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626950

The dot matches any character (but a newline, if the DOTALL mode is off). The * and *? match 0+ characters, just the .* grabs everything at once, and *? does it step by step, checking the subsequent subpatterns for a match before retrying the *? subppatern. You confuse the meaning of "greedy" here: note that regex tries to find a match "by all means", it retries quantified subpatterns when a mismatch occurs at some location in the string, the engine backtracks with greedy quantifiers, it does everything to fetch a match. Lazy quantifiers are not a guarantee that your pattern won't overfire just because of an extra ? in the quantifier definition.

Thus, avoid using dot matching patterns if you do not really mean that, if the pattern is known. Here, the printable ASCII pattern is required - then use it, do not rely on dot matching.

@{{(?P<id>TAGS|[ -~]+?):(?:(?P<attr>DATADIR|TAGSFILE|RESULT_INT)|(?P<expr>[ -~]+?))}}@

See the regex demo

Note that [ -~] matches any printable ASCII character. See My favorite regex of all time.

The pattern matches:

  • @{{ - leading delimiter
  • (?P<id>TAGS|[ -~]+?) - an id group matching TAGS or 1+ printable ASCII chars, but as few as possible since it also matches : (you can restrict the character class with a lookahead to exclude : or replace with [ -9;-~] to make the pattern even more optimal and get rid of ? with this quantifier)
  • : - literal :
  • (?:(?P<attr>DATADIR|TAGSFILE|RESULT_INT)|(?P<expr>[ -~]+?)) - matches DATADIR, TAGSFILE, or RESULT_INT and places into attr group, or matches one or more printable ASCII (as few as possible) and places into group "expr". Again, it is lazy here because [ -~] matches a }. Otherwise, you could use a tempered greedy token here: (?:(?!}}@)[ -~])+. See demo
  • }}@ - trailing delimiter

Upvotes: 3

L3viathan
L3viathan

Reputation: 27283

Here's a regex that seems to do what you want:

@{{(?:(?:(?P<id>[^:]*?):(?P<attr>DATADIR|TAGSFILE|RESULT_INT))|TAGS:(?P<expr>.*?))}}@

This correctly matches:

MATCH 1
expr    [8-37]  `sTagsJob << "job||ID||source"`

MATCH 2
id  [50-53] `job`
attr    [54-61] `DATADIR`

MATCH 3
id  [79-82] `job`
attr    [83-90] `DATADIR`

There is one bug in your regex that prevents it from matching the TAG part (forgot a kleene star in P<expr>.?))}}@?), but that alone doesn't solve the issue with matching too much, so I changed the dot after <id> to be a "non-colon".

Upvotes: 0

Related Questions