Roger Costello
Roger Costello

Reputation: 3209

Is my lexer doing too much -- is it doing the work of the parser?

My input consists of a series of names, each on a new line. Each name consists of a firstname, optional middle initial, and lastname. The name fields are separated by tabs. Here is a sample input:

Sally   M.    Smith
Tom     V.    Jones
John          Doe

Below are the rules for my Flex lexer. It works fine but I am concerned that my lexer is doing too much: it is determining that a token is a firstname or a middle initial or a lastname. Should that determination be done in the parser, not the lexer? Am I abusing the Flex state capability? What I am seeking is a critique of my lexer. I am just a beginner, how would a parsing expert create lexer rules for this input?

<INITIAL>{
         [a-zA-Z]+          { yylval.strval = strdup(yytext); return(FIRSTNAME); }
         \t                 { BEGIN MI_STATE; }
         .                  { BEGIN JUNK_STATE; }
}
<MI_STATE>{
        [A-Z]\.             { yylval.strval = strdup(yytext); return(MI); }
        \t                  { BEGIN LASTNAME_STATE; }
         .                  { BEGIN JUNK_STATE; }
}
<LASTNAME_STATE>{
         [a-zA-Z]+          { yylval.strval = strdup(yytext); return(LASTNAME); }
         \n                 { BEGIN INITIAL; return EOL; }
         .                  { BEGIN JUNK_STATE; }
}
<JUNK_STATE>.               { printf("JUNK: %s\n", yytext);  }

Upvotes: 0

Views: 194

Answers (2)

Kaz
Kaz

Reputation: 58500

Yes; your lexer is parsing. The main evidence is that it's implementing identical rules in different start states. Two rules have exactly the same pattern.

The purpose of start states in the context of lexing is to modify the lexical grammar in order to shield the parser from certain differences. It works with the parser. For instance, say you had some document language in which $ shifts into math expression mode, which has different tokenizing rules. The lexer still just returns tokens in math mode; it doesn't try to parse the math expressions. It is the parser which will determine that, if the brackets are balanced, then another $ can shift out of math mode.

In your code the rules for returning a last name and first name are identical; you have used to start state to handle phrase structure syntax: the fact that the last name comes later than the first name.

Another bit of telltale evidence that the lexer is parsing is that the lexer itself is handling all of the start condition changes. In our $...$ math mode example, we might have the lexer shift into a start state when it sees the $ symbol. However, if the lexer also recognizes the end of math mode, then that is evidence it is parsing the math mode expression. The end can only be recognized by following the nested phrase structure grammar of math mode expressions. The way you would handle that would be for the lexer to expose a function lex_end_math_mode(). When the parser processes and reduces the entire math mode expression, it calls this function to tell the lexer to switch back to the lexical syntax outside of math mode. The math-mode-terminating dollar sign would likely also appear as a token visible to the parser, though the leading one might not. So that is to say, the parser parses math_mode_expr : expr '$': a math mode expression followed by a required dollar sign to end math mode. The action for that rule would include the call to lex_end_math_mode, so the lexer returns to the tokenization rules outside of math mode for scanning the next token after the closing $.

There is no right or wrong answer because it's all parsing. Every grammar that is divided into tokens and phrase structure rules could be expressed by a unified grammar which includes the rules for the token structure.

Why we often use a design which separates lexical scanning and parsing is that:

  • Unifying the lexical and phrase structure grammar into one will turn LL(1) into LL(k). A recursive-descent parser then needs to look k symbols ahead to make parsing decisions. For instance if you're parsing C with this holistic approach, you need to treat int as a reserved keyword. That requires four symbols of lookahead: you have to recognize i, n, t, and then if the next symbol indicates that the token has ended, you treat that as the keyword, otherwise an identifier.

  • Performance: lexical scanning uses efficient techniques tailored to that task, which take advantage of the restriction that the lexical grammar is regular.

  • Correspondence to spec: if you have a language whose specification is described in terms of a lexical grammar separate from a phrase structure grammar, then if you implement it that way, features of your code are more readily traceable to features of requirement spec. You may be able to write unit tests which separately show that the lexing and parsing obeys the spec.

  • Schooling: people who went through a CS program that included a course on compiler construction had separate lexing and parsing drilled into their heads, and whenever it comes up in their subsequent career, they just lean on that wisdom. They are never confronted with situations in which they recognize it as not being a good approach, and don't question it.

Whatever works in your individual situations with whatever you're parsing overrules the theory. If it's convenient for you to recognize some phrase-like fragments in the lexer, and you're able to convince yourself that it's a clean approach, then by all means do it that way.

Upvotes: 0

David Gorsline
David Gorsline

Reputation: 5003

You can use lexer states as you do in this question. But it's better to use them as a means to conditionally activate rules. For examples, think of handling multi-line comments or here documents or (for us silverbacks) embedded SQL.

In your question, there's no lexical difference between a given name and a family name -- they both are matched by [a-zA-Z]+, as would be middle names, if you were to extend your lexer.

Short answer: yes, lex NAME tokens and let the parser determine whether you have three NAME tokens on a line.

Upvotes: 1

Related Questions