Reputation: 17040
Say I am writing a lexer for scanning robots.txt.
Say I want to start with the old RFC (http://www.robotstxt.org/norobots-rfc.txt) and only want to do the most basic syntax:
# COMMENT
User-agent: Google
Disallow: /admin
Disallow: /foo.html
Allow: /bar/bar.html
The very basic rules would be like this in Python
_RULES = [
(WHITESPACE, r"\s+"),
(COLON, r":"),
(COMMENT, r"#.+"),
(UA_ID, r"user-agent"),
(ALLOW_ID, r"allow"),
(DISALLOW_ID, r"disallow"),
(RPATH, r"(\/[a-zA-Z0-9|\:|\@|\&|\=\.]*)+"),
(UA_VALUE, r".+"),
]
Note if we put UA_VALUE
before RPATH
, the greediness will match /admin
to be UA_VALUE
, which is not what we really want later in the parser process.
Now supposed I want my lexer to take in a set of "extension rules" (many robots.txt now have "custom" directives, which was foreseen in the old RFC as "extension").
An extension might be to support full URL instead of path fragement like RPATH
. Or if we support Craw-delay
which takes in numbers or fractions. Now the user has to insert the new extension rules into the _RULES
in the right order to avoid matching to the wrong rule, producing the wrong kind of token.
I can see the problem with using regex incompetently, but I believe this order-sensitive issue can also exist in the hand-written, char-by-char, switch-statement based approach.
In general, can we avoid this ordering problem? Or do I attempt to "fix" the wrong token type later in the semantic analysis by "trying to fix" the kind of token by looking at its neighbors? It might work for trivial language like robots.txt. Do I need to break some of my rules further down to smaller unit? For example, instead of regex the entire http/https://url...
I would do http://
and https://
as two separate token types.
Upvotes: 0
Views: 53
Reputation: 241731
You could just include the : in the field-name tokens. Also, the "standard" insists that the declaration start in the first column, so you could also add an anchor operator (^
). If that were not the case, you could dress the regex up a bit:
"^[[:blank:]]*Allow[[:blank:]]*:"
Beyond that, robots.txt
, like many internet standards, does not yield very well to non-semantic lexical analysis. Individual directives have their own lexical rules. Unextended robots.txt
is an easy case, since the only things that are allowed are (partial) URLs in the case of Allow
and Disallow
and arbitrary strings not containing a # in the case of User-Agent
. But extensions might have their own formats.
My usual approach to parsing field: value
protocols is to extract the field (using a simple regex like the one above), and then look the syntax up in a table which maps field-names to value-parsers. There are a variety of variants on this theme, but that's the basic idea.
Upvotes: 1