Stan
Stan

Reputation: 1237

ANTLR4 parse rule to match open/close brackets

I'm parsing a language that has a statement 'code' followed by '{', followed by a bunch of code that I have no interest in parsing, followed by '}'. I'd ideally like to have a rule like:

skip_code: 'code' '{' ~['}']* '}'

..which would simply skip ahead to the closing curly brace. The problem is that the code being skipped could itself have pairs of curly braces. So, what I essentially need to do is run a counter and increment on each '{' and decrement on each '}', and end the parse rule when the counter is back to 0.

What's the best way of doing this in ANTLR4? Should I skip off to a custom function when 'code' is detected and swallow up the tokens and run my counter, or is there some elegant way to express this in the grammar itself?

EDIT: Some sample code, as requested:

class foo;
  int m_bar;
  function foo_bar;
     print("hello world");
  endfunction
  code {
     // This is some C code
     void my_c_func() {
        printf("I have curly braces {} in a string!");
     }
  }
  function back_to_parsed_code;
  endfunction
endclass

Upvotes: 5

Views: 4843

Answers (3)

Ivan Kochurkin
Ivan Kochurkin

Reputation: 4481

You can use modes for your purpose. Take attention on two modes for CODE section. Yoy can not properly close CODE section with only one mode.

Lexer

lexer grammar Question_41355044Lexer;

CODE: 'code';
LCURLY: '{' -> pushMode(CODE_0);
WS:    [ \t\r\n] -> skip;

mode CODE_0;

CODE_0_LCURLY: '{' -> type(OTHER), pushMode(CODE_N);
RCURLY: '}' -> popMode;     // Close for LCURLY
CODE_0_OTHER: ~[{}]+ -> type(OTHER);

mode CODE_N;

CODE_N_LCURLY: '{' -> type(OTHER), pushMode(CODE_N);
CODE_N_RCURLY: '}' -> type(OTHER), popMode;
OTHER: ~[{}]+;

Parser

parser grammar Question_41355044Parser;

options { tokenVocab = Question_41355044Lexer; }

skip_code: 'code' LCURLY OTHER* RCURLY;

Input

code {
   // This is some C code
   void my_c_func() {
      printf("I have curly braces {} in a string!");
   }
}

Output tokens

CODE LCURLY({) OTHER(   // Th...) OTHER({) OTHER(      pr...) 
OTHER({) OTHER(}) OTHER( in a st...) OTHER(}) OTHER() RCURLY(}) EOF

The same approach is used for ANTLR grammar parsing itself: https://github.com/antlr/grammars-v4/tree/master/antlr4

But runtime code LexerAdaptor.py is used there instead of two-level modes.

Upvotes: 0

Bart Kiers
Bart Kiers

Reputation: 170178

I'd handle these code blocks in the lexer. A quick demo:

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;

public class Main {

    public static void main(String[] args) {

        String source = "class foo;\n" +
                "  int m_bar;\n" +
                "  function foo_bar;\n" +
                "     print(\"hello world\");\n" +
                "  endfunction\n" +
                "  code {\n" +
                "     // This is some C code }}} \n" +
                "     void my_c_func() {\n" +
                "        printf(\"I have curly braces {} in a string!\");\n" +
                "     }\n" +
                "  }\n" +
                "  function back_to_parsed_code;\n" +
                "  endfunction\n" +
                "endclass";

        System.out.printf("Tokenizing:\n\n%s\n\n", source);

        DemoLexer lexer = new DemoLexer(new ANTLRInputStream(source));

        for (Token t : lexer.getAllTokens()){
            System.out.printf("%-20s '%s'\n",
                    DemoLexer.VOCABULARY.getSymbolicName(t.getType()),
                    t.getText().replaceAll("[\r\n]", "\\\\n")
            );
        }
    }
}

If you run the class above, the following will be printed:

Tokenizing:

class foo;
  int m_bar;
  function foo_bar;
     print("hello world");
  endfunction
  code {
     // This is some C code }}} 
     void my_c_func() {
        printf("I have curly braces {} in a string!");
     }
  }
  function back_to_parsed_code;
  endfunction
endclass

ID                   'class'
ID                   'foo'
ANY                  ';'
ID                   'int'
ID                   'm_bar'
ANY                  ';'
ID                   'function'
ID                   'foo_bar'
ANY                  ';'
ID                   'print'
ANY                  '('
STRING               '"hello world"'
ANY                  ')'
ANY                  ';'
ID                   'endfunction'
ID                   'code'
BLOCK                '{\n     // This is some C code }}} \n     void my_c_func() {\n        printf("I have curly braces {} in a string!");\n     }\n  }'
ID                   'function'
ID                   'back_to_parsed_code'
ANY                  ';'
ID                   'endfunction'
ID                   'endclass'

Upvotes: 1

Mike Lischke
Mike Lischke

Reputation: 53357

I'd use something like:

skip_code: CODE_SYM block;
block: OPEN_CURLY (~CLOSE_CURLY | block)* CLOSE_CURLY;

CODE_SYM: 'code';
OPEN_CURLY: '{';
CLOSE_CURLY: '}';

Upvotes: 5

Related Questions