Robert Martin
Robert Martin

Reputation: 17157

Using lex to tokenize without failing

I'm interested in using lex to tokenize my input string, but I do not want it to be possible to "fail". Instead, I want to have some type of DEFAULT or TEXT token, which would contain all the non-matching characters between recognized tokens.

Anyone have experience with something like this?

Upvotes: 1

Views: 115

Answers (2)

user207421
user207421

Reputation: 310850

To expand on @Chris Dodd's answer, the final rule in any lex script should be:

. return yytext[0];

and don't write any single-character rules like "+" return PLUS;. Just use the special characters you recognize directly in the grammar, e.g. term: term '+' factor;.

This practice:

  • saves you a lot of lex rules
  • makes your grammar much more readable
  • returns illegal characters as tokens to the parser, where you can do anything you like with them, or nothing, in which case you get the benefit of yacc's error recovery.

Upvotes: 1

Chris Dodd
Chris Dodd

Reputation: 126175

Use the pattern . at the end of all your lex rules to match any character that isn't matched by any other rule. You may also need a \n rule to match newlines (a newline is the only character the . doesn't match)

If you want to combine adjacent non-matching characters into a single token, that is harder, and is more easily done in the parser.

Upvotes: 1

Related Questions