Reputation: 4538
I am working on a simple toy language for describing UI widgets. I have used bison/flex to implement the grammar. I would now like to create an AST from the grammar rules. However, I am not sure of the "level of granularity" of the AST and what should be included in it. From what I have read, ASTs should be "minimal" and avoid providing redundant information. At the same time obviously I should not loose any information from the original source. Are there specific rules to follow when building ASTs?
The grammar is given below:
%{
#include <string>
#include <iostream>
void yyerror (const char *error);
int yylex();
extern int yylineno;
%}
%code{
// int lineno;
}
%union {
char* s;
double d;
int i;
}
/* declare tokens */
%token APPLICATION GEOMETRY
%token FORM BUTTON LABEL TEXTINPUT
%token ID NAME
%token WIDTH HEIGHT TEXT ONCLICK
%token ABSOLUTELAYOUT LINEARLAYOUT GRIDLAYOUT
%token ABSOLUTE_LAYOUT_DOT_LEFT ABSOLUTE_LAYOUT_DOT_TOP
%token EOL ENDOFFILE
%token<s> IDENTIFIER STRING
%token<d> INTEGER
%%
appDef: APPLICATION '{' NAME ':' STRING formdefList '}' {std::cout << "found app" << std::endl;}
iddef : ID ':' IDENTIFIER
formdefList : /* nothing */
| formdefList formdef
;
formdef : FORM '{' formbodydef '}'
;
formbodydef : /*nothing*/
| iddef
| formbodydef layoutdef
| error {
std::cout << "found error in form body near line: " << yylineno << std::endl;
std::exit(1);
}
;
layoutdef : ABSOLUTELAYOUT '{' layoutBody '}'
| GRIDLAYOUT '{' layoutBody '}'
| LINEARLAYOUT '{' layoutBody '}'
;
layoutBody : /* nothing */
| layoutBody layoutdef /* Layouts can be embedded in one another */
| layoutBody buttonDef
| layoutBody labelDef
| error { std::cout << "Was expecting a child control near line: " << yylineno << std::endl; std::exit(1);}
;
geometrydef : GEOMETRY ':' '{' geometrybody '}' { std::cout << "start of geometry def:" << std::endl;}
;
geometrybody : /*nothing*/ { std::cout << "start of geometrybody def" << std::endl;}
| geometrybody WIDTH ':' INTEGER
| geometrybody HEIGHT ':' INTEGER
| geometrybody ABSOLUTE_LAYOUT_DOT_LEFT ':' INTEGER
| geometrybody ABSOLUTE_LAYOUT_DOT_TOP ':' INTEGER
| error { std::cout << "error near line: " << yylineno << std::endl; std::exit(1);}
;
buttonDef : BUTTON '{' buttonBody '}'
;
buttonBody : /*nothing*/
| buttonBody iddef
| buttonBody TEXT ':' STRING
| buttonBody geometrydef
;
labelDef : LABEL '{' labelBody '}'
;
labelBody : /*nothing*/
| labelBody iddef
| labelBody TEXT ':' STRING
| labelBody geometrydef
;
%%
void yyerror (const char *error)
{
std::cout << error << std::endl;
}
Here is some sample input for validating the grammar:
Application{
name: "HelloWorld"
Form{
id: MainForm
LinearLayout{
Button{
geometry: {width:20 height:20}
}
AbsoluteLayout{
Button{
geometry:{
width:20
height:20
AbsoluteLayout.Top:10
AbsoluteLayout.Left:20
}
}
}
}
}
Form{
id: Secondary
}
}
I am now in the process of building an AST for the grammar and need some advice on its structure:
formdef
, formbodydef
, formdefList
explicitly appear in the ASTs? ID
, NAME
, etc...) and token value? Thanks for your advice!
Upvotes: 3
Views: 238
Reputation: 95306
Nodes need to contain an "abstract" syntax category (something typically close to the LHS of many rules). See https://stackoverflow.com/a/1916687/120163 for a discussion of AST vs CSTs, and why you might want to have nodes contain a concrete syntax category (e.g., exactly the LHS of rules, or the concrete names of leaves). The link discusses when you should keep terminals in your AST.
They should also contain values if they represent something that varies in a way that grammar does not record (e.g, identifier names, numeric and string literal constants, etc.) See https://stackoverflow.com/a/6320259/120163 for why you should convert values when you lex, not later.
Shown here: https://stackoverflow.com/a/17393852/120163 are examples of what ASTs might look like if you print them out (you better build an AST printer to help you debug them).
How much information you need to keep in node depends on what you intend to do with it eventually. If you want to regenerate source from the AST, you need to keep a lot of stuff you might not ever have considered, e.g. radix of converted numbers. See https://stackoverflow.com/a/5834775/120163 for more details.
Upvotes: 3