Reputation: 3542
I need to implement a simple parser which would parse the following string: a[b[c,d],e],f[g,h[i], j[k,l]],...
In the example above there is a GROUP of one or more OBJECTS. Each object has nested OBJECTS(b in a AND h,j in f) plus nested VALUES(c,d,e,g,i,k,l).
So, it's something like:
GROUP : OBJECT (,OBJECT)*
OBJECT: NAME '[' OBJECTorVALUE (OBJECTorVALUE)* ']'
OBJECTorVALUE: OBJECT | VALUE
VALUE : NAME
What will be the easiest way to parse such grammar manually?
I tried to parse it with recursive descent parser and it's doable but looks weird because you have to resolve the production for OBJECTorVALUE:
OBJECTorVALUE: NAME '[' blabla ']'
OBJECTorVALUE: NAME
So, to make it LL grammar (to be parsed by rec-descent) we should resolve it to
OBJECTorVALUE: NAME Z
Z: eps | '[' blabla ']'
and now rec-desc parser gets some unnecessary complexity and feels unnatural, so I have to introduce one of rec-desc methods to look-ahead for Z after the name. So, instead of easy-peasy methods
parseGroup -> parseObjects
parseObj -> parseName consume[ parseObjectValueGroup consume ]
parseObjectValueGroup -> parseObjectValues
parseObjectValue -> look('[') -> parseObj OR parseValue
I get the same but the last statement becomes
parseObjectValue -> parseName parseZ
parseZ -> look('[') -> parseObjWithoutName OR return empty
parseObjWithoutName -> ...
That looks messy to me, so can we use anything more simple here? I mean it's damn simple grammar which can be parsed even by string splits and looks as easy as it can be. Maybe Simple LR parser (SLR) will be more readable in the case?
Upvotes: 2
Views: 266
Reputation: 1579
This is really a very simple grammar. Your grammar is incomplete, because NAME
is not defined. The grammar defined in ABNF (RFC 5234) syntax is like this (assuming the NAME rule):
group = element *("," WS element)
element = name WS ["[" WS group "]" WS]
name = 1*(%x61-7A / %x41-5A)
WS = *(%x20 / %x9 / %xA / %xD)
You can read it as: a group is made of elements separated by commas. Each element has a name and maybe after it there is a sub elements group in squire brackets. A 'name' is one or more English alphabet characters. Between each part of the grammar there is permitted white space (zero or more of): space (20 hex), tab (9 hex), line feed (A hex) or carriage return (D hex). If you do not want white space, just remove the WS rule completely.
This is the syntax, but there is and semantics: if an element
has only a name, then its a value
else its an object
. This is a simple disambiguation, that you should do after the parsing is complete.
If the grammar is so simple and you do not want to code it yourself, I have made a tool (Tunnel Grammar Studio) that can generate this grammar for you in C++ code. The grammar is simple, so the demo of the tool should be enough for you. You can also run the grammar online too. TGS generates iterative parsers, meaning, that you should not worry for stack overflow problems if your groups are nesting much.
ps. Instead of the hex values you can also write 1*('a'-'z' / 'A'-'Z')
, because TGS has extensions of the ABNF syntax to make it easy to write grammars.
Upvotes: 1
Reputation: 372684
I actually think trying to parse this top-down isn't going to be so bad. If you try writing this out as a formal grammar it looks a lot worse than it actually is. For example, here's what this might look like in pseudocode:
ParseGroup():
result = empty list
while true:
read token 'next'; it's some identifier.
if (there's no next token) {
append 'next' to result.
break;
}
else if (the next token is '[') {
let inner = ParseGroup();
read token; it's a ']';
append 'next[inner]' to result.
} else {
append 'next' to result.
}
read token; it's a comma.
append ',' to result.
}
}
This is basically a slightly-modified LL(1) parser for the modified grammar.
Upvotes: 1