James Ko
James Ko

Reputation: 34609

ANTLR - How to determine what kind of parse tree "best fits" some code

I'm building a program with ANTLR where I ask the user to enter some Java code, and it spits out equivalent C# code. In my program, I ask the user to enter some Java code and then parse it. Up until now I've been assuming that they will enter something that will parse as a valid compilation unit on its own, e.g. something like

package foo;

class A { ... }
class B { ... }
class C { ... }

However, that isn't always the case. They might just enter code from the inside of a class:

public void method1() {
    ...
}

public void method2() {
    ...
}

Or the inside of a method:

System.out.print("hello ");
System.out.println("world!");

Or even just an expression:

context.getSystemService(Context.ACTIVITY_SERVICE)

If I try to parse such snippets by calling parser.compilationUnit(), it won't work correctly because most of the code is parsed as error nodes. I need to call the correct method depending on the nature of the code, such as parser.expression() or parser.blockStatements(). However, I don't want to ask the user to explicitly indicate this. What's the best way to infer what kind of code I'm parsing?

Upvotes: 4

Views: 343

Answers (2)

Ivan Kochurkin
Ivan Kochurkin

Reputation: 4501

Our algorithm in Swiftify tries to select the best suitable parse rule from the defined rule set. This web-service converts Objective-C code fragments to Swift and you can estimate the quality of conversion immediately by your own.

Algorithm

We use open-sourced ObjectiveC grammar. Detail Steps of algorithm look like this:

  1. Parse input Objective-C code fragment with the following rules
    • translationUnit
    • implementationDefinitionList
    • interfaceDeclarationList
    • expression
    • compoundStatement
  2. If parse result of the certain rule does not contain any error returns this rule at once.
  3. Select the rule with the nearest to the end parse error.
  4. If there are two or more rules with the same nearest to the end error location, select the rule with the minimum number of syntax errors.

Demo

There are test code samples that parsed with different parse rules:

Our algorithm is able to detect suitably parse rule even with an incorrect input:

Upvotes: 1

GRosenberg
GRosenberg

Reputation: 6001

Rather than trying to guess a valid grammar rule entry point to parse a language snippet of unknown scope, progressively add scope wrappers to the source text until a valid top-level rule parse is achieved.

That is, with each successive parse failure, progressively add dummy package, class, & method statements as source text wrappers.

Whichever wrapper was added to achieve a successful parse will then be a known quantity. Therefore, the parse tree node representing the original source text can be easily identified.

Probably want to use a fail-fast parser; construct the parser with the BailErrorStrategy to obtain this behavior.

Upvotes: 1

Related Questions