Delimiting tokens containing wildcards and alternate endings with ANTLR

Question

I have to write a parser for a legacy programming language to translate it into another. SQL statements can be embedded directly in assignments.

Since I don't need to actually parse SQL, but just pass it as a string to a library function of the target environment, I wanted to recognize SQL statements as tokens at the lexer level using the following rule.

 SqlStatement : SELECT .+ ';' ;

Unfortunately sql statements can be either terminated by a semicolon or the keyword EXECUTING (which introduces a block of commands, but this is not relevant).

I cannot simply define another token as:

SqlAndExecute : SELECT .+ EXECUTING ;

Since the two overlap and this causes ANTLR to (surprisingly?) emit a spurious "ELECT" token.

Even if it worked, I can't even write something like

 SqlStatement : SELECT .+ ';' | EXECUTING;

because I need to differentiate between the two forms.

Can I get this result at all? I've tried to write syntactic predicates but I'm probably still missing something.

I'd prefer to avoid parsing SQL queries if possible.

NB: SELECT is defined as S E L E C T with fragment S: 's'|'S', and so on for the other letters in the identifier; similarly for EXECUTING

Bart Kiers · Accepted Answer

Don't use .+ ';' in this case: with that, you cannot make a distinction between a ';' as the end of an SQL statement and one inside a string literal.

So make distinction between a SqlAndExecute and SqlStatement, you simply match what both tokens have in common, and then, at the end, change the type of the token like this:

Sql
 : SELECT Space SqlAtom+ ( ';'       {$type=SqlStatement;}
                         | EXECUTING {$type=SqlAndExecute;}
                         )
 ;

fragment SqlStatement  : /* empty, used only for the token-type */ ;
fragment SqlAndExecute : /* empty, used only for the token-type */ ;

Now, an SqlAtom is either a string literal, or, when there's not EXECUTING ahead, any character other than a single quote ('\'') or a semi colon (';'). The "when there's not EXECUTING ahead"-part must be handled by some manual extra look-ahead in the lexer and a semantic predicate.

A quick demo:

grammar T;  

@lexer::members {

  private boolean aheadIgnoreCase(String text) {
    int i;

    for(i = 0; i < text.length(); i++) {

      String charAhead = String.valueOf((char)input.LA(i + 1));

      if(!charAhead.equalsIgnoreCase(String.valueOf(text.charAt(i)))) {
        return false;
      }
    }

    // there  can't be a letter after 'text', otherwise it would be an identifier
    return !Character.isLetter((char)input.LA(i + 1));
  }
}

parse
 : (t=. {System.out.printf("\%-15s'\%s'
", tokenNames[$t.type], $t.text);})* EOF
 ;

Sql
 : SELECT SP SqlAtom+ ( ';'       {$type=SqlStatement;}
                      | EXECUTING {$type=SqlAndExecute;}
                      )
 ;

Space
 : SP+ {skip();}
 ;

Id
 : ('a'..'z' | 'A'..'Z')+
 ;

fragment SqlAtom
 : {!aheadIgnoreCase("executing")}?=> ~('\'' | ';')
 | Str
 ;

fragment Str : '\'' ('\'\'' | ~('\'' | '
' | '
'))* '\'';

fragment SELECT    : S E L E C T;
fragment EXECUTING : E X E C U T I N G;
fragment SP        : ' ' | '	' | '
' | '
';

fragment C : 'c' | 'C';
fragment E : 'e' | 'E';
fragment G : 'g' | 'G';
fragment I : 'i' | 'I';
fragment L : 'l' | 'L';
fragment N : 'n' | 'N';
fragment S : 's' | 'S';
fragment T : 't' | 'T';
fragment U : 'u' | 'U';
fragment X : 'x' | 'X';

fragment SqlStatement  : ;
fragment SqlAndExecute : ;

And if you now parse the input:

Select bar from EXECUTINGIT EXECUTING
x
Select foo from EXECUTING
y
SELECT a FROM b WHERE c=';' and More;

the following will be printed to the console:

SqlAndExecute  'Select bar from EXECUTINGIT EXECUTING'
Id             'x'
SqlAndExecute  'Select foo from EXECUTING'
Id             'y'
SqlStatement   'SELECT a FROM b WHERE c=';' and More;'

EDIT

Note that the Sql rule now always produces an SqlStatement or SqlAndExecute token. In other words: there will never be a Sql token. If you want to match either a SqlStatement or SqlAndExecute, create a parser rule that matches one of them:

sql
 : SqlStatement
 | SqlAndExecute
 ;

and use sql in your parser rule(s) instead of Sql.

Delimiting tokens containing wildcards and alternate endings with ANTLR

Answers (1)

EDIT

Related Questions