BangularTK
BangularTK

Reputation: 554

Writing Lex Specification for Identifiers

I am trying to write the lex specification to recognize Identifiers (Ids) in a language with restrictions: (1) The Id can be a Letter followed by a Letter/Digit/_ (maxlength 5) (2) The Id cannot be any of the keywords: "if" "else" "for" "while".... etc

Number (1) I did with no issues. My problem is trying to figure out the 2nd one. The keywords are not tokens for now- so the exclusion has to be implemented within the regular expression for the Id.

The regular expression I have so far is: {L}(({L}|{D}|_){1,4}) where L and D is a Letter and Digit respectively. How can I go about adding the exclusion of keywords here?

Upvotes: 0

Views: 469

Answers (1)

sepp2k
sepp2k

Reputation: 370132

It's possible to write a regex that meets restriction 2, but that's quite complicated and error-prone and scales badly with the number of keywords (that is every time you add a new keyword, you'd have to painstakingly expand the regex to be even more complicated).

Instead this problem is usually solved by defining the regex for identifiers without taking keywords into account and taking advantage of lex's rules for resolving ambiguities, which are as such:

  1. If there are multiple regexes that could match on the current input, the one with the longest match is chosen (this is known as the maximum munch rule). For example for the input "ifay" the lexer will produce IDENT("ifay") instead of KEYWORD_IF (followed by IDENT("ay")) because "ifay" is the longer match.

  2. If multiple regexes produce matches of the same length, lex will pick the one whose definition comes first in the lex file. So "if" will be tokenized as a keyword rather than an identifier as long as the definition for KEYWORD_IF comes before the definition of IDENTIFIER in the lex file.

So as long as you put your keyword definitions before your identifier definition, the lexer will work as you want without you needing to adjust the identifier rule.

Upvotes: 0

Related Questions