Nick
Nick

Reputation: 780

How to implement the lexer hack for a C parser in ANTLR

Is it possible to implement the classic Yacc lexer hack to differentiate between identifier names and type names in a C parser generated by ANTLR4, using a standard C grammar (like the one found on the official ANTLR4 GitHub repo) ?

It seems ad-hoc code that can be plugged in ANTLR4 lexers is quite limited. In the book "The Definitive ANTLR4 Reference", Terrence Parr says:

"A common practice that s been around forever involves sending feedback from the parser to the lexer so that the lexer can send precise tokens to the parser. [...] Unfortunately, this is not possible with ANTLR grammars because ANTLR-generated parsers often look very far ahead in the token stream to make parsing decisions. [...]"

Any way to circumvent the above, and implement that feedback loop? Or is the implementation of a C parser in ANTLR4 just impossible without resorting to crazy hacks when visiting the parse tree?

Upvotes: 5

Views: 1262

Answers (2)

Ira Baxter
Ira Baxter

Reputation: 95306

Who says you have to ask the parsing machinery anything to implement the lexer hack? The hack is in the lexer.

The key to the C lexer hack is to have the lexer, all by itself, record which identifiers are typenames. You can do this in the lexer by tracking the emitted lexical tokens. With enough ad hockery (which is why it is called a hack) your lexer can recognize "typedef .... x;" as it goes by, and record "x" as a type in a lexer-local symbol table.

Then when an identifier is encountered by the lexer, that happens to be typename according to its local symbol table, the lexer can emit "type_identifier" rather than "identifier".

Your lexer has to keep track of scoping constructs, too, in case the typedef is scope-local, and adjust the table of live typedefs accordingly as it scans across scope boundaries.

Of course, this hack works only if you know that you have a language like C, in which the parser will not backtrack back across the tokens because they might have a different parse. AFAIK, ANTLR doesn't backtrack.

(Yes, if the parser will provide access to a symbol table that it tracks as some kind of side effect of parsing, you can simply have the lexer consult that table, again provided the parser won't backtrack. What makes this "safe" is presumably that while executing the lexer, the parsing subystem is not executing in the parser, and so symbol table accesses are safe. If your lexer is producing a pipeline of lexemes in advance of the parser's use, then it is running asynchronously WRT to the parser and now it cannot use the parser's symbol table; you have to fall back on a lexer-local symbol table as described above).

(If you switch to a GLR parser, you can avoid the hack.)

Upvotes: 2

DeftlyHacked
DeftlyHacked

Reputation: 405

No. You literally quoted the answer to your own question.

On a side note, writing a parser isn't that hard. Forewarning, I wrote this recursive descent parser off the top of my head; it's not been tested, but it nonetheless will give you the basic idea:

#include <ctype.h>     // isalnum, isalpha, isdigit
#include <stdio.h>     // printf
#include <stdlib.h>    // atoi
#include <string.h>    // strncmp


/* ASCII numbers are positive, and only between 0 and 127.
 * We can thus use negative numbers to identify special tokens. */
enum {
    TOK_SPACE = -1,
    TOK_ID = -2,
    TOK_NUM = -3,
    TOK_ELSE = -4,
    TOK_IF = -5
};


int token_length;
union {
    char *s;
    int i;
} token_value;


int get_token(char *str) {
    token_length = 0;
    if (isspace(*str)) {
        do {
            token_length++;
            str++;
        } while (isspace(*str));
        return TOK_SPACE;
    }
    if (isalpha(*str) || *str == '_') {
        token_value.s = str;
        do {
            token_length++;
            str++;
        } while (isalnum(*str) || *str == '_');
        if (!strncmp(token_value.s, "else", token_length))
            return TOK_ELSE;
        if (!strncmp(token_value.s, "if", token_length))
            return TOK_IF;
        return TOK_ID;
    }
    if (isdigit(*str)) {
        token_value.s = str;
        do {
            token_length++;
            str++;
        } while (isdigit(*str));
        token_value.i = atoi(str);
        return TOK_NUM;
    }
    token_length = 1;
    return *str;
}

int is_conditional(char *str) {
    int tok, len = 0, rule_len;
    tok = get_token(str);
    if (tok == TOK_SPACE) {      // accept space
        len += token_length;
        str += token_length;
        tok = get_token(str);
    }
    if (tok != TOK_ID && tok != TOK_NUM) // require id or number
        return 0;
    return len + token_length;   // return the length of the entire match
}

int is_if_statement(char *str) {
    int tok, len = 0, rule_len;
    tok = get_token(str);
    if (tok == TOK_SPACE) {      // accept space
        len += token_length;
        str += token_length;
        tok = get_token(str);
    }
    if (tok == TOK_IF) {         // require 'if'
        len += token_length;
        str += token_length;
        tok = get_token(str);
    } else return 0;
    if (tok == TOK_SPACE) {      // accept space
        len += token_length;
        str += token_length;
        tok = get_token(str);
    }
    if (tok == '(') {            // require l-paren
        len++;
        str++;
        tok = get_token(str);
    } else return 0;
    rule_len = is_conditional(str);
    if (rule_len != 0) {         // require conditional
        len += rule_len;
        str += rule_len;
        tok = get_token(str);
    } else return 0;
    if (tok == TOK_SPACE) {      // accept space
        len += token_length;
        str += token_length;
        tok = get_token(str);
    }
    if (tok == ')') {            // require r-paren
        len++;
        str++;
        tok = get_token(str);
    } else return 0;
    if (tok == TOK_SPACE) {      // accept space
        len += token_length;
        str += token_length;
        tok = get_token(str);
    }
    if (tok != TOK_ID)
        return 0;
    return len + token_length;   // return the length of the entire match
}

void parse(char *str) {
    int rule_len;
    if ((rule_len = is_if_statement(str)))
        printf("an if-statement was matched!\n");
    else
        printf("the input is not valid code!\n");
    str += rule_len;
}

int main(int argc, char *argv[]) {
    parse("if (x) y");     // valid!
    parse("if (1) ...");   // invalid!
    return 0;
}

Upvotes: -3

Related Questions