Jamie
Jamie

Reputation: 7451

Why is the flex regex being skipped?

I can't, for the life of me, figure out what's wrong with my regex's.

What I'd like to tokenize are two (2) types of strings, both of which to be contained on a single line. One string can be anything (other than a new line), and the other, any alpha-numeric (ASCII) character and literal '_', '/' '-', and '.'.

The snippet of flex code is:

nl    \n|\r\n|\r|\f|\n\r
...
%%
...
\"[^\"]+{nl} { frx_parser_error("Label is missing trailing double quote."); }
\"[a-zA-Z0-9_\.\/\-]+\" {
      if (yyleng > 1024) frx_parser_error("File name too long.");
      yytext[yyleng - 1] = '\0';
      frx_parser_lval.str = strdup(yytext+1);
      fprintf(stderr,"TOSP_FILENAME: %s\n", frx_parser_lval.str);
      return (TOSP_FILENAME);
   }
\"[^{nl}]+\" {
      yytext[yyleng - 1] = '\0';
      frx_parser_lval.str = strdup(yytext+1);
      fprintf(stderr,"TOSP_IDENTIFIER:\n%s\n",  frx_parser_lval.str);
      return (TOSP_IDENTIFIER);
   }

And when I run the parser, the fprintf's spit this out:

TOSP_FILENAME: ModStar-Picture-Analysis.txt
TOSP_FILENAME: ModStar-Rubric.log.txt
TOSP_IDENTIFIER:
picture-A"
Progress (26,255)   camera 'C' root("picture-C-
Syntax (line 34): syntax error

For whatever reason, the quote after picture-A is being ... missed. Why? I checked the ASCII values for the eight locations the quote character appears and they're all 0x22 (where the double quutoes appear that is).

If I add some characters to the end of the "picture-A" it can work sometimes; adding ".par", ".pbr" doesn't work as expected, but ".pnr" does.

I've even added a specific non-regexy token:

\"picture-A\"    { frx_parser_lval.str = strdup("picture-A"); return TOSP_FILENAME; }

to the lex file and it gets skipped.

I'm using flex 2.5.39, no flex libraries, one option (%option prefix=frx_parser_) in the lex file and the flex command line is:

flex -t script-lexer.l  > script-lexer.c

What gives?

EDIT I need to test this on the actual system, but unit tests show this tokenizer to be much more robust (based on rici's answer):

nl      \n|\r\n|\r|\f|\n\r
...
%%
...
["][^"]+{nl}           { printf("Missing trailing quote.\n%s\n",yytext); }
["][[:alnum:]_./-]+["] { printf("File name:\n%s\n",yytext); }
["][^"]+["]            { printf("String:\n%s\n",yytext); } 

EDIT The rule ["].+["] swallows consecutive multiple strings as one big string. It was changed to ["][^"]+["]

Upvotes: 0

Views: 110

Answers (1)

rici
rici

Reputation: 241931

The problem is your pattern:

\"[^{nl}]+\" 

You're attempting to expand a definition inside a character class, but that is not possible; inside a character class, { is always just a {, not a flex operator. See the flex manual:

Note that inside of a character class, all regular expression operators lose their special meaning except escape (‘\’) and the character class operators, ‘-’, ‘]]’, and, at the beginning of the class, ‘^’.

A definition is not a macro. Rather, a definition defines a new regular expression operator.

As a consequence of the above, you can write [^\"] as simply [^"] and \"[a-zA-Z0-9_\.\/\-]+\" as \"[a-zA-Z0-9_./-]+\" (The - needs to be either at the end or at the beginning.) Personally, I'd write the second pattern as:

["][[:alnum:]_./-]+["]

But everyone has their own style.

Upvotes: 3

Related Questions