CppLearner
CppLearner

Reputation: 17040

Rules passed to lexer is order-sensitive

Say I am writing a lexer for scanning robots.txt.

Say I want to start with the old RFC (http://www.robotstxt.org/norobots-rfc.txt) and only want to do the most basic syntax:

# COMMENT
User-agent: Google
Disallow: /admin
Disallow: /foo.html
Allow: /bar/bar.html

The very basic rules would be like this in Python

_RULES = [
    (WHITESPACE, r"\s+"),
    (COLON, r":"),
    (COMMENT, r"#.+"),
    (UA_ID, r"user-agent"),
    (ALLOW_ID, r"allow"),
    (DISALLOW_ID, r"disallow"),
    (RPATH, r"(\/[a-zA-Z0-9|\:|\@|\&|\=\.]*)+"),
    (UA_VALUE, r".+"),
]

Note if we put UA_VALUE before RPATH, the greediness will match /admin to be UA_VALUE, which is not what we really want later in the parser process.

Now supposed I want my lexer to take in a set of "extension rules" (many robots.txt now have "custom" directives, which was foreseen in the old RFC as "extension").

An extension might be to support full URL instead of path fragement like RPATH. Or if we support Craw-delay which takes in numbers or fractions. Now the user has to insert the new extension rules into the _RULES in the right order to avoid matching to the wrong rule, producing the wrong kind of token.

I can see the problem with using regex incompetently, but I believe this order-sensitive issue can also exist in the hand-written, char-by-char, switch-statement based approach.

In general, can we avoid this ordering problem? Or do I attempt to "fix" the wrong token type later in the semantic analysis by "trying to fix" the kind of token by looking at its neighbors? It might work for trivial language like robots.txt. Do I need to break some of my rules further down to smaller unit? For example, instead of regex the entire http/https://url... I would do http:// and https:// as two separate token types.

Upvotes: 0

Views: 53

Answers (1)

rici
rici

Reputation: 241731

You could just include the : in the field-name tokens. Also, the "standard" insists that the declaration start in the first column, so you could also add an anchor operator (^). If that were not the case, you could dress the regex up a bit:

"^[[:blank:]]*Allow[[:blank:]]*:"

Beyond that, robots.txt, like many internet standards, does not yield very well to non-semantic lexical analysis. Individual directives have their own lexical rules. Unextended robots.txt is an easy case, since the only things that are allowed are (partial) URLs in the case of Allow and Disallow and arbitrary strings not containing a # in the case of User-Agent. But extensions might have their own formats.

My usual approach to parsing field: value protocols is to extract the field (using a simple regex like the one above), and then look the syntax up in a table which maps field-names to value-parsers. There are a variety of variants on this theme, but that's the basic idea.

Upvotes: 1

Related Questions