Reputation: 1237
I'm parsing a language that has a statement 'code' followed by '{', followed by a bunch of code that I have no interest in parsing, followed by '}'. I'd ideally like to have a rule like:
skip_code: 'code' '{' ~['}']* '}'
..which would simply skip ahead to the closing curly brace. The problem is that the code being skipped could itself have pairs of curly braces. So, what I essentially need to do is run a counter and increment on each '{' and decrement on each '}', and end the parse rule when the counter is back to 0.
What's the best way of doing this in ANTLR4? Should I skip off to a custom function when 'code' is detected and swallow up the tokens and run my counter, or is there some elegant way to express this in the grammar itself?
EDIT: Some sample code, as requested:
class foo;
int m_bar;
function foo_bar;
print("hello world");
endfunction
code {
// This is some C code
void my_c_func() {
printf("I have curly braces {} in a string!");
}
}
function back_to_parsed_code;
endfunction
endclass
Upvotes: 5
Views: 4843
Reputation: 4481
You can use modes for your purpose. Take attention on two modes for CODE section. Yoy can not properly close CODE section with only one mode.
Lexer
lexer grammar Question_41355044Lexer;
CODE: 'code';
LCURLY: '{' -> pushMode(CODE_0);
WS: [ \t\r\n] -> skip;
mode CODE_0;
CODE_0_LCURLY: '{' -> type(OTHER), pushMode(CODE_N);
RCURLY: '}' -> popMode; // Close for LCURLY
CODE_0_OTHER: ~[{}]+ -> type(OTHER);
mode CODE_N;
CODE_N_LCURLY: '{' -> type(OTHER), pushMode(CODE_N);
CODE_N_RCURLY: '}' -> type(OTHER), popMode;
OTHER: ~[{}]+;
Parser
parser grammar Question_41355044Parser;
options { tokenVocab = Question_41355044Lexer; }
skip_code: 'code' LCURLY OTHER* RCURLY;
Input
code {
// This is some C code
void my_c_func() {
printf("I have curly braces {} in a string!");
}
}
Output tokens
CODE LCURLY({) OTHER( // Th...) OTHER({) OTHER( pr...)
OTHER({) OTHER(}) OTHER( in a st...) OTHER(}) OTHER() RCURLY(}) EOF
The same approach is used for ANTLR grammar parsing itself: https://github.com/antlr/grammars-v4/tree/master/antlr4
But runtime code LexerAdaptor.py
is used there instead of two-level modes.
Upvotes: 0
Reputation: 170178
I'd handle these code blocks in the lexer. A quick demo:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
public class Main {
public static void main(String[] args) {
String source = "class foo;\n" +
" int m_bar;\n" +
" function foo_bar;\n" +
" print(\"hello world\");\n" +
" endfunction\n" +
" code {\n" +
" // This is some C code }}} \n" +
" void my_c_func() {\n" +
" printf(\"I have curly braces {} in a string!\");\n" +
" }\n" +
" }\n" +
" function back_to_parsed_code;\n" +
" endfunction\n" +
"endclass";
System.out.printf("Tokenizing:\n\n%s\n\n", source);
DemoLexer lexer = new DemoLexer(new ANTLRInputStream(source));
for (Token t : lexer.getAllTokens()){
System.out.printf("%-20s '%s'\n",
DemoLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replaceAll("[\r\n]", "\\\\n")
);
}
}
}
If you run the class above, the following will be printed:
Tokenizing:
class foo;
int m_bar;
function foo_bar;
print("hello world");
endfunction
code {
// This is some C code }}}
void my_c_func() {
printf("I have curly braces {} in a string!");
}
}
function back_to_parsed_code;
endfunction
endclass
ID 'class'
ID 'foo'
ANY ';'
ID 'int'
ID 'm_bar'
ANY ';'
ID 'function'
ID 'foo_bar'
ANY ';'
ID 'print'
ANY '('
STRING '"hello world"'
ANY ')'
ANY ';'
ID 'endfunction'
ID 'code'
BLOCK '{\n // This is some C code }}} \n void my_c_func() {\n printf("I have curly braces {} in a string!");\n }\n }'
ID 'function'
ID 'back_to_parsed_code'
ANY ';'
ID 'endfunction'
ID 'endclass'
Upvotes: 1
Reputation: 53357
I'd use something like:
skip_code: CODE_SYM block;
block: OPEN_CURLY (~CLOSE_CURLY | block)* CLOSE_CURLY;
CODE_SYM: 'code';
OPEN_CURLY: '{';
CLOSE_CURLY: '}';
Upvotes: 5