DimChtz
DimChtz

Reputation: 4333

Multi-line match inside literals in Flex

I am trying to match text inside %[ and ]% in single or multiple lines. First thing I tried was:

\%\[(.*?)\]\%              return MULTILINE_TEXT;

but this works only for single line cases, not for multiple lines. So, I thought I could use /s:

/\%\[(.*?)\]\%/s           return MULTILINE_TEXT;

But flex see this as an invalid rule. The last thing I tried was:

\%\[((.*?|\n)*?)\]\%       return MULTILINE_TEXT;

which seemed to work, but it doesn't stop at the first ]%. In the following example:

%[ Some text ...
   Some text ... ]%

... other stuff ...

%[ Some more text ...
   Some more text ... ]%

flex will return the entire thing as a single token. What can I do?

Upvotes: 0

Views: 525

Answers (1)

rici
rici

Reputation: 241771

Note that *? is not treated as a non-greedy match by flex.

Flex does support some regex flags, but its syntax is a little different than most regex libraries. For example, you can change the meaning of . by setting the s flag; the change applies to the region within the parentheses (and not following the flag setting, as in PCRE):

"%["(?s:.*)"%]"

It's more common to see the lex-compatible usage:

"%["(.|\n)*"%]"

You can also use the x flag for slightly more readable regexes:

(?xs: "%[" .* "%]" )

(The x flag does not work in definitions, only in pattern rules.)

Quoted strings (as above) is another (f)lex-specific syntax, which can be more readable than backslash escapes, although backslash escapes also work. But flex does not implement PCRE/Gnu/JS extensions such as \w and \s.

See the flex manual for a complete guide to flex regexes; it's definitely worth reading if you are used to other regex syntaxes.

You will probably find it disappointing that (f)lex does not support many common regex extensions, including non-greedy matches. That makes it awkward to write patterns for patterns terminated by multiple characters, as with your example. If the delimiters %[ and %] cannot be nested, so that you really want the match to end with the first %], you could use something like this:

%\[([^%]|%+[^]])*%+\]   or  (?x: "%[" ( [^%] | %+ [^]] )* %* "%]" ) 

That's a bit hard to read, but it is precise: %[ followed by any number of repetitions of either a character other than % or a sequence of % followed by something other than ], ending with a sequence of % followed by a ].

In the above pattern, you need %+ rather than % to deal with strings like:

%[%% text surrounded by percents%%%]

A more readable solution which also allows for nested %[ is to use start conditions. There's a complete example of a very similar solution in this answer.

Upvotes: 6

Related Questions