Reputation: 8965
I've attempted to encode the following EBNF grammar for my system's macros into a regular expression (below), but despite my best efforts it appears to greedily match across multiple macros: it doesn't stop at the closing }}@
.
Expand ``@{{...}}@`` references which may appear in Step parameters.
The syntax is described by the following EBNF grammar::
depdata = "@{{", source identifier, ":", attribute, "}}@"
| "@{{TAGS:", expression, "}}@" ;
source identifier = ? printable 7-bit ASCII ? ;
attribute = "DATADIR" | "TAGSFILE" | "RESULT_INT" ;
expression = ? printable 7-bit ASCII ? ;
and the Python regular expression I've come up with
@{{(?:(?:(?P<id>.*?):(?P<attr>DATADIR|TAGSFILE|RESULT_INT))|TAGS:(?P<expr>.+?))}}@
Edit: Added missing +
within expr
group
When finding all the matches in the following test-case, I expect the result to be three matches, but I only get two:
@{{TAGS:sTagsJob << "job||ID||source"}}@ test @{{job:DATADIR}}@ email body @{{job:DATADIR}}@ blah
The matches I expect are:
@{{TAGS:sTagsJob << "job||ID||source"}}@
with the expr
group set@{{job:DATADIR}}@
with id
and attr
groups set@{{job:DATADIR}}@
(again) with id
and attr
groups setBut instead, the matches are:
@{{TAGS:sTagsJob << "job||ID||source"}}@ test @{{job:DATADIR}}@
@{{job:DATADIR}}@
How come the non-greedy matches (.+?
) seem to be acting greedy? What have I missed?
(And yes, I'm aware the EBNF grammar is silly and could be improved by always having the fixed strings appear on the right hand side. But that's not my question: I want to learn why my regex-fu has failed me)
Upvotes: 3
Views: 211
Reputation: 626950
The dot matches any character (but a newline, if the DOTALL mode is off). The *
and *?
match 0+ characters, just the .*
grabs everything at once, and *?
does it step by step, checking the subsequent subpatterns for a match before retrying the *?
subppatern. You confuse the meaning of "greedy" here: note that regex tries to find a match "by all means", it retries quantified subpatterns when a mismatch occurs at some location in the string, the engine backtracks with greedy quantifiers, it does everything to fetch a match. Lazy quantifiers are not a guarantee that your pattern won't overfire just because of an extra ?
in the quantifier definition.
Thus, avoid using dot matching patterns if you do not really mean that, if the pattern is known. Here, the printable ASCII pattern is required - then use it, do not rely on dot matching.
@{{(?P<id>TAGS|[ -~]+?):(?:(?P<attr>DATADIR|TAGSFILE|RESULT_INT)|(?P<expr>[ -~]+?))}}@
See the regex demo
Note that [ -~]
matches any printable ASCII character. See My favorite regex of all time.
The pattern matches:
@{{
- leading delimiter(?P<id>TAGS|[ -~]+?)
- an id
group matching TAGS
or 1+ printable ASCII chars, but as few as possible since it also matches :
(you can restrict the character class with a lookahead to exclude :
or replace with [ -9;-~]
to make the pattern even more optimal and get rid of ?
with this quantifier):
- literal :
(?:(?P<attr>DATADIR|TAGSFILE|RESULT_INT)|(?P<expr>[ -~]+?))
- matches DATADIR
, TAGSFILE
, or RESULT_INT
and places into attr
group, or matches one or more printable ASCII (as few as possible) and places into group "expr". Again, it is lazy here because [ -~]
matches a }
. Otherwise, you could use a tempered greedy token here: (?:(?!}}@)[ -~])+
. See demo}}@
- trailing delimiterUpvotes: 3
Reputation: 27283
Here's a regex that seems to do what you want:
@{{(?:(?:(?P<id>[^:]*?):(?P<attr>DATADIR|TAGSFILE|RESULT_INT))|TAGS:(?P<expr>.*?))}}@
MATCH 1
expr [8-37] `sTagsJob << "job||ID||source"`
MATCH 2
id [50-53] `job`
attr [54-61] `DATADIR`
MATCH 3
id [79-82] `job`
attr [83-90] `DATADIR`
There is one bug in your regex that prevents it from matching the TAG
part (forgot a kleene star in P<expr>.?))}}@
?), but that alone doesn't solve the issue with matching too much, so I changed the dot after <id>
to be a "non-colon".
Upvotes: 0