Rules passed to lexer is order-sensitive

Question

Say I am writing a lexer for scanning robots.txt.

Say I want to start with the old RFC (http://www.robotstxt.org/norobots-rfc.txt) and only want to do the most basic syntax:

# COMMENT
User-agent: Google
Disallow: /admin
Disallow: /foo.html
Allow: /bar/bar.html

The very basic rules would be like this in Python

_RULES = [
    (WHITESPACE, r"\s+"),
    (COLON, r":"),
    (COMMENT, r"#.+"),
    (UA_ID, r"user-agent"),
    (ALLOW_ID, r"allow"),
    (DISALLOW_ID, r"disallow"),
    (RPATH, r"(\/[a-zA-Z0-9|\:|\@|\&|\=\.]*)+"),
    (UA_VALUE, r".+"),
]

Note if we put UA_VALUE before RPATH, the greediness will match /admin to be UA_VALUE, which is not what we really want later in the parser process.

Now supposed I want my lexer to take in a set of "extension rules" (many robots.txt now have "custom" directives, which was foreseen in the old RFC as "extension").

An extension might be to support full URL instead of path fragement like RPATH. Or if we support Craw-delay which takes in numbers or fractions. Now the user has to insert the new extension rules into the _RULES in the right order to avoid matching to the wrong rule, producing the wrong kind of token.

I can see the problem with using regex incompetently, but I believe this order-sensitive issue can also exist in the hand-written, char-by-char, switch-statement based approach.

In general, can we avoid this ordering problem? Or do I attempt to "fix" the wrong token type later in the semantic analysis by "trying to fix" the kind of token by looking at its neighbors? It might work for trivial language like robots.txt. Do I need to break some of my rules further down to smaller unit? For example, instead of regex the entire http/https://url... I would do http:// and https:// as two separate token types.

Rules passed to lexer is order-sensitive

Answers (1)

Related Questions