Reputation: 1284
I have to write a parser for a legacy programming language to translate it into another. SQL statements can be embedded directly in assignments.
Since I don't need to actually parse SQL, but just pass it as a string to a library function of the target environment, I wanted to recognize SQL statements as tokens at the lexer level using the following rule.
SqlStatement : SELECT .+ ';' ;
Unfortunately sql statements can be either terminated by a semicolon or the keyword EXECUTING (which introduces a block of commands, but this is not relevant).
I cannot simply define another token as:
SqlAndExecute : SELECT .+ EXECUTING ;
Since the two overlap and this causes ANTLR to (surprisingly?) emit a spurious "ELECT" token.
Even if it worked, I can't even write something like
SqlStatement : SELECT .+ ';' | EXECUTING;
because I need to differentiate between the two forms.
Can I get this result at all? I've tried to write syntactic predicates but I'm probably still missing something.
I'd prefer to avoid parsing SQL queries if possible.
NB: SELECT is defined as S E L E C T
with fragment S: 's'|'S'
, and so on for the other letters in the identifier; similarly for EXECUTING
Upvotes: 2
Views: 533
Reputation: 170227
Don't use .+ ';'
in this case: with that, you cannot make a distinction between a ';'
as the end of an SQL statement and one inside a string literal.
So make distinction between a SqlAndExecute
and SqlStatement
, you simply match what both tokens have in common, and then, at the end, change the type of the token like this:
Sql
: SELECT Space SqlAtom+ ( ';' {$type=SqlStatement;}
| EXECUTING {$type=SqlAndExecute;}
)
;
fragment SqlStatement : /* empty, used only for the token-type */ ;
fragment SqlAndExecute : /* empty, used only for the token-type */ ;
Now, an SqlAtom
is either a string literal, or, when there's not EXECUTING
ahead, any character other than a single quote ('\''
) or a semi colon (';'
). The "when there's not EXECUTING
ahead"-part must be handled by some manual extra look-ahead in the lexer and a semantic predicate.
A quick demo:
grammar T;
@lexer::members {
private boolean aheadIgnoreCase(String text) {
int i;
for(i = 0; i < text.length(); i++) {
String charAhead = String.valueOf((char)input.LA(i + 1));
if(!charAhead.equalsIgnoreCase(String.valueOf(text.charAt(i)))) {
return false;
}
}
// there can't be a letter after 'text', otherwise it would be an identifier
return !Character.isLetter((char)input.LA(i + 1));
}
}
parse
: (t=. {System.out.printf("\%-15s'\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
Sql
: SELECT SP SqlAtom+ ( ';' {$type=SqlStatement;}
| EXECUTING {$type=SqlAndExecute;}
)
;
Space
: SP+ {skip();}
;
Id
: ('a'..'z' | 'A'..'Z')+
;
fragment SqlAtom
: {!aheadIgnoreCase("executing")}?=> ~('\'' | ';')
| Str
;
fragment Str : '\'' ('\'\'' | ~('\'' | '\r' | '\n'))* '\'';
fragment SELECT : S E L E C T;
fragment EXECUTING : E X E C U T I N G;
fragment SP : ' ' | '\t' | '\r' | '\n';
fragment C : 'c' | 'C';
fragment E : 'e' | 'E';
fragment G : 'g' | 'G';
fragment I : 'i' | 'I';
fragment L : 'l' | 'L';
fragment N : 'n' | 'N';
fragment S : 's' | 'S';
fragment T : 't' | 'T';
fragment U : 'u' | 'U';
fragment X : 'x' | 'X';
fragment SqlStatement : ;
fragment SqlAndExecute : ;
And if you now parse the input:
Select bar from EXECUTINGIT EXECUTING
x
Select foo from EXECUTING
y
SELECT a FROM b WHERE c=';' and More;
the following will be printed to the console:
SqlAndExecute 'Select bar from EXECUTINGIT EXECUTING'
Id 'x'
SqlAndExecute 'Select foo from EXECUTING'
Id 'y'
SqlStatement 'SELECT a FROM b WHERE c=';' and More;'
Note that the Sql
rule now always produces an SqlStatement
or SqlAndExecute
token. In other words: there will never be a Sql
token. If you want to match either a SqlStatement
or SqlAndExecute
, create a parser rule that matches one of them:
sql
: SqlStatement
| SqlAndExecute
;
and use sql
in your parser rule(s) instead of Sql
.
Upvotes: 2