Strange lexing issue keywords vs identifiers regex matching

Question

I've been struggling to understand some behavior of flex.

I started defining a small toy-like example program which will tokenize into keywords and strings.

One definition of the regex performs as expected but another behaves quite differently, contrary to my expectation.

It has been a few years since I've played with this stuff so hopefully someone can point me in the right direction.

I modified the token regular expression to get it to work but I'd really like to understand why my original choice behaved differently.

This first example is the non-working code

%{
#include 
using namespace std;
%}
%option noyywrap
%%
[ 	
]                   {cout << "ws" << endl;};
buzz                      {cout << "kw" << endl;};
[^
]+                    {cout << "str" << endl;};
%%
int main(){
yylex();
}

The second example is the modified version which does behave properly.

%{
#include 
using namespace std;
%}
%option noyywrap
%%
[ 	
]                   {cout << "ws" << endl;};
buzz                      {cout << "kw" << endl;};
[a-zA-Z]+                 {cout << "str" << endl;};
%%
int main(){
yylex();
}

In the code, buzz is supposed to be a keyword, and anything following should be just read as a string.

For the first example, buzz gets consumed up along with the remaining word as a "str".

In the second example, buzz is properly recognized and the remaining word becomes the "str".

I understand that the third rule in both cases is also a valid definition for a token containing the characters b-u-z-z. Each of these four letters is in [^ ]+, as well as [a-zA-Z]+. So why on earth is the behavior different?

example inputs would be:

buzz lightyear
buzz aldren

Thanks!

sepp2k · Accepted Answer

Flex (as well as most other lexer generators) works according to the maximum munch rule. That rule says that if multiple patterns can match on the current input, the one that produces the longest match is chosen. If multiple pattern produce a match of the same size, the one that appears first in the .l file is chosen.

So in your working solution the patterns buzz and [a-zA-Z0-9]+ both match buzz, so buzz is chosen because it appears first in the file (if you switched the two lines, str would be printed instead). In your non-working solution buzz still would only match buzz, but [^ ]+ matches buzz lightyear and buzz aldren respectively, which is the longer match. Thus it wins according to the maximum munch rule.

Strange lexing issue keywords vs identifiers regex matching

Answers (1)

Related Questions