Reputation: 17177

Using lex to tokenize without failing

I'm interested in using lex to tokenize my input string, but I do not want it to be possible to "fail". Instead, I want to have some type of DEFAULT or TEXT token, which would contain all the non-matching characters between recognized tokens.

Anyone have experience with something like this?

Upvotes: 1

Answers (2)

user207421

Reputation: 311050

To expand on @Chris Dodd's answer, the final rule in any lex script should be:

. return yytext[0];

and don't write any single-character rules like "+" return PLUS;. Just use the special characters you recognize directly in the grammar, e.g. term: term '+' factor;.

This practice:

saves you a lot of lex rules
makes your grammar much more readable
returns illegal characters as tokens to the parser, where you can do anything you like with them, or nothing, in which case you get the benefit of yacc's error recovery.

Upvotes: 1

Chris Dodd

Reputation: 126527

Use the pattern . at the end of all your lex rules to match any character that isn't matched by any other rule. You may also need a \n rule to match newlines (a newline is the only character the . doesn't match)

If you want to combine adjacent non-matching characters into a single token, that is harder, and is more easily done in the parser.

Upvotes: 1

Using lex to tokenize without failing

Answers (2)

Related Questions