user1972748
user1972748

Reputation: 69

Recursive Descent Parser in Java

I would like to preface this by saying this is a homework assignment for my third Year Programming Languages Class, and I'm looking for some help with it. My assignment reads:

Deadline: February 22, 2013 at 11:55pm
Submission: Please upload the following to the CMS.

1. Source code
2. A screen shot of the execution of your program including the input file you used

Use any programming language you prefer to write a recursive-descent parser that parses the language generated by the following EBNF descriptions. Your parser should detect whether or not the input program has any syntax errors. It does not have to specify what and where the error is.

<program>   begin <stmt_list> end
<stmt_list>  <stmt> {;<stmt_list>}
<stmt>    <assign_stmt> | <while_stmt>
<assign_stmt>  <var> = <expr>
<var>  identifier  (An identifier is a string that begins with a letter followed by 0     or more letters and digits)
<expr>  <var> { (+|-) <var>}           
<while_stmt>   while (<logic_expr>)  <stmt>
<logic_expr> ® <var> (< | >) <var>  (Assume that logic expressions have only less than     or greater than operators)

The symbols that look funny are just arrows pointing to the right.

My problem at the moment is more logical then it is programming: in my first attempt, I read the entire input program in, saved it to a string, then parsed that string and converted every symbol to either a terminal, expr, or what have you.

I eventually found that this way would not work because, A: I don't think that it is RDP, B: many of the non terminals are made of more then 1 statement.

I gave up on that approach, and decided that before I waste more time programming, I would Pseudo everything out. My new idea was to make 1 method, for each non terminal symbol, and just parse the input string symbol by symbol, hoping between those methods. This approach seemed logical, but as I started writing the pseudocode I got very lost and confused as to what I needed to do. How would I finish this code?

Here is some pseudocode for RDP:

intputString;

public void parseProgram (Symbol.typeIsProgram) {
    if getNextSymbol == "begin" {
        if (intputString.substring (inputString.length()-3,
                inputString.length()) == "end") {
            Symbol stmt_lsit = new Symbol (intputString)
            parseStmt_list(stmt_list);              
        } else {
            Out "error, prog must end with end"
        }
    } else {
        Out "error, prog must begin with begin"
    }   
}

public void parseStmt_list (Stmbol.typeIsStmt_list) {
    symbol = getNextSymbol;
    if (Symbol.typeIsVar) {
        parseVar(symbol)
    } else if (Symbol.typeIsWhile)  {
        // weve only capture if the first word return is a while, we dont have the whole while statement yet
        ParseWhile_stmt(symbol)
    } else { }
}

public void parseStmt () { }
public void parseAssign_stmt () { }
public void parseVar () { }
public void parseExpr () { }
public void parseWhile_stmt () { }
public void parseLogic_expr () { }

public Symbol getNextSymbol() {
    //returns the next symbol in input string and removes it from the input string
}

Just an FYI a sample input program for my parser would be.

begin 
total = var1 + var2; 
while (var1 < var2) 
while ( var3 > var4)
var2 = var2 - var1 
end

Upvotes: 5

Views: 21270

Answers (3)

Dmitry Tikhonov
Dmitry Tikhonov

Reputation: 301

According to this particular assignment it is alright to use string processing in a functional way.

Try this algorithm:

  1. check, that line begins with begin and ends with end
  2. if yes - split remaining string with ; and do the following steps for each part (statement)
  3. check if statement consists a = or while substring
  4. for assignment check you can split an input with + or - and for each part check the variable condition (alphanumeric content)
  5. for while - check for ( and ) and recursively process two strings between brackets and after it
  6. finally, for logical expression split by substring < or >, and check that parts are variables (alphanumeric)

Maybe, it is not very flexible solution, but as I see it is acceptable for your task.

EDIT:

I found the task interesting and tried to write a complete solution.

public enum Parser {
PROGRAM {
    void parse(String s) throws ParseException {
        s = s.trim();
        if (s.startsWith("begin") && s.endsWith("end")) {
            STATEMENT_LIST.parse(s.substring(5, s.length() - 3));
        } else {
            throw new ParseException("Illegal begin/end");
        }
    }
},

STATEMENT_LIST {
    void parse(String s) throws ParseException {
        String[] parts = s.trim().split(";");
        for (String part : parts) {
            STATEMENT.parse(part.trim());
        }
    }
},

STATEMENT {
    void parse(String s) throws ParseException {
        if (s.startsWith("while")) {
            WHILE.parse(s.substring(5));
        } else if (s.contains("=")) {
            ASSIGNMENT.parse(s);
        } else {
            throw new ParseException("Illegal statement: " + s);
        }
    }
},

WHILE {
    void parse(String s) throws ParseException {
        int i = s.indexOf("(");
        int j = s.indexOf(")");
        if (i != -1 && j != -1) {
            LOGICAL.parse(s.substring(i + 1, j).trim());
            STATEMENT.parse(s.substring(j + 1).trim());
        } else {
            throw new ParseException("Illegal while: " + s);
        }
    }
},

ASSIGNMENT {
    void parse(String s) throws ParseException {
        int i = s.indexOf("=");
        if (i != -1) {
            VARIABLE.parse(s.substring(0, i).trim());
            EXPRESSION.parse(s.substring(i + 1).trim());
        }
    }
},

EXPRESSION {
    void parse(String s) throws ParseException {
        String[] parts = s.split("\\+|-");
        for (String part : parts) {
            VARIABLE.parse(part.trim());
        }
    }
},

LOGICAL {
    void parse(String s) throws ParseException {
        int i;
        if (s.contains("<")) {
            i = s.indexOf("<");
        } else if (s.contains(">")) {
            i = s.indexOf(">");
        } else {
            throw new ParseException("Illegal logical: " + s);
        }

        VARIABLE.parse(s.substring(0, i).trim());
        VARIABLE.parse(s.substring(i + 1).trim());
    }
},

VARIABLE {
    void parse(String s) throws ParseException {
        if (!s.matches("^[a-zA-Z][a-zA-Z0-9]*$")) {
            throw new ParseException("Illegal variable: " + s);
        }
    }
};

abstract void parse(String s) throws ParseException;

public static void main(String[] args) {
    try {
        PROGRAM.parse("begin \n" +
                "total = var1 + var2; \n" +
                "while (var1 < var2) \n" +
                "while ( var3 > var4)\n" +
                "var2 = var2 - var1 \n" +
                "end");
        System.out.println("OK");
    } catch (ParseException e) {
        System.out.println(e.getMessage());
    }
}
}

class ParseException extends Exception {
public ParseException(String message) {
    super(message);
}
}

Upvotes: 9

user1401452
user1401452

Reputation:

1) Tokenization

First break the input into tokens. In this case, every identifier and operator and literal. Make a big list of all the input tokens. Once you have tokens then you can start parsing. Make the tokens a linked list, so you can just say Token.Next to read the next token or Token.Next.Next to read 2 tokens ahead. Put a bunch of EOF tokens at the end so you will never skip past it.

2) Parsing

Parsing is like what you have already. So instead of thinking Symbol think Token. The Parse Statements list is a while loop that keeps parsing statements until the end.

For Parse Statement

public void parseStmt ()
{
  if (Token.kind == KEYWORD && Token.id == kw_while) {
    return ParseWhileStatement();
  }
  else {
    return ParseAssignStatement();
  }
}

Parsing the while statement will loop back to parse statements and thus it will "recursively descend" back into Parse statement, yielding nested while loops etc...

Parsing the assignment statement is very similar. PArse the Left side and then the right side. You need a bunch of functions for that....

The Node here is an Ast node. Abstract Syntax Tree.

Stuff like...

class Node {

}
class OpNode {
  OpNode Lhs;
  OpNode Rhs;
}
class MultiplyNode : OpNode {
  MultiplyNode(byref Token tok, OpNode Left, OpNode right) {
    tok = tok.Next;
    Lhs = left;
    Rhs = right;
  }
}




Node ParseSimpleExp() {
  if (Token.kind == Identifier) {
    Node n = new IdentNode;
    NextToken();
    return n;
  }
  if (Token.kind == Number) {
    Node n = new NumberNode;
    NextToken();
    return n;
  }
}


// In these examples move the token to next token when you create the new nodes
Node ParseMulExp() {
  Node n = ParseSimpleExp();
  while (1) {
    if (Token.Kind == Multiply) {
      n = new MultiplyNode(Token,n,ParseSimpleExp());
      continue;
    }
    if (Token.Kind == Divide) {
      n = new DivideNode(Token,n,ParseSimpleExp());
      continue;
    }
    return n;
 }
}

Node ParseAddExp() {
  Node n = ParseMulExp();
  while (1) {
    if (Token.Kind == Add) {
      n = new AddNode(Token,n,ParseMulExp());
      continue;
    }
    if (Token.Kind == Subtract) {
      n = new SubtractNode(Token,n,ParseMulExp());
      continue;
    }
    return n;
  }
}


Node ParseAssignStatement() {
  Node n = ParseAddExp();
  if (Token.kind == ASSIGN) {
    return new AssignStatement(Token,n,ParseAddExp());
  }
}

If you follow the logic you can see how the precedence rules are followed by recursively arriving at each target. Parsing expressions and starting from the assign is not a loop. It is like shown here. Obviously this is to simple, but it shows the technique.

So the RDP is caused by looking at the current token and then jumping to some function to handle token. Naturally this can come back to the same function hence is recursive. If you look at the ParseSimpleExp function then you can see that is a good place to maybe handle a parenthesized expression. A parens expression will cause recursion back to simple exp and possibly all the others like mul and add.

Upvotes: 4

Wolfgang Fahl
Wolfgang Fahl

Reputation: 15769

The structure of your parser code should resemble the structure of the language syntax. E.g.

 <program>  ::= begin <stmt_list> end

would translate into something like

function parse_program() {
  parse_begin();
  repeat parse_stmt();
  parse_end();
}

You might want not to confuse the token handling (scanner) with the parsing of the structure (parser).

I'd go with exception handling instead of if / else structures for error handling. You might want to keep track of where you are in the source (scanner) part for displaying a proper error message. Just ask the scanner for its state.

The assignment fortunately seems to need no conflict resolution so your recursive decent should work well. The interesting part is where in the parsing of

<while_stmt> ::= while (<logic_expr>)  <stmt>

you'll end up to call parse_stmt() recursively. This is what the whole idea of recursive descent parsing is about.

Upvotes: 1

Related Questions