Reputation: 15615
I have data similar to YAML and need to create a grammar for it using Pyparsing. Like Python, Yaml's data scope is defined by the whitespace
data:
object : object_name
comment : this object is created first
methods:
method_name:
input:
arg1: arg_type
arg2: arg2_type
output:
methond2_name:
input:
output:
arg1 : arg_type
After parsing the above, it should output something similar to this:
{'comment': 'this object is created first',
'object': 'object_name',
'methods': {'method_name': {'input': {'arg1': 'arg_type', 'arg2': 'arg2_type'},
'output': None}, 'methond2_name': {'input': None, 'output': {'arg1': 'arg_type'}}}}
[EDIT] The data is similar to YAML but not exactly the same. So YAML Python parser is not able to parse it. I left of some of the details to make the example data simpler
Upvotes: 5
Views: 964
Reputation: 1069
This cannot be parsed with Pyparsing. There are a few principle problems. YAML cannot be parsed with a grammar parser. I will illustrate this with Python code.
It is easy to see why Python code cannot be parsed with a grammar parser. See this example:
def a():
def b():
def c():
x = (3 +
3)
The white spaces that make the indentation matter in Python. So if you want to parse this with a grammar, the parse tree must somehow contain these white spaces. The parse tree of the (3 + 3)
expression would be something like <open-paren> 3 <add> <indent> <indent> <indent> 3 <close-paren>
. But these indents have nothing to do with the that +
expression. The principal problem with a grammar parser would be that it sees the spaces where they are, and this is inside of expressions. But semantically they don't belong to that expression. They belong to the def
instead. The grammar does not see them, where they are needed. And a pure grammar parser is unable to move things around, so there is no solution.
Luckily Pyparsing has a workaround for this principal problem. I explain this again with Python code. The Python compiler when parsing Python code, first applies a preprecessing (Don't confuse this with lexing!). It replaces the spaces with some sort of special symbols. That yields something like this.
def a():
<indent>
def b():
<indent>
def c():
<indent>
x = (3 + 3)
<dedent>
<dedent>
<dedent>
This version now can be parsed with a grammar parser.
If you want to go that way in Pyparsing, you can use the function IndentedBlock. You find an example here.
But there are more problems. I guess you want that the following two versions are identical.
object : object_name
comment : this object is created first
methods:
method_name:
input:
arg1: arg_type
arg2: arg2_type
output:
object : object_name
methods:
method_name:
input:
arg1: arg_type
arg2: arg2_type
output:
comment : this object is created first
And I guess you also want that the following is illegal.
methods:
method_name:
input:
arg1: arg_type
arg2: arg2_type
output:
object : object_name
methods:
method_name:
input:
arg1: arg_type
arg2: arg2_type
output:
Assume you have grammar parsers for the methods block, the object block and the comment block. How can we combine these parsers? A grammar parser has no memory. When the first methods block is parsed, it somehow must memorise it, so that another methods block, that could come later is not accepted.
In order to workaround this principal problem, you would have to define the combination in a way like this:
my_parser = (objects + methods + comment)
| (methods + objects + comment)
| (comment + methods + object)
| (comment + object + methods)
You can imagine, that this is not practicable.
If you want a fixed order (object + comment + methods)
of course it makes more sense to use Pyparsing. But YAML is not ordered.
Finally, let me say, that Pyparsing is made for grammar parsing. This is where Pyparsing is strong. The text that you have in the question is not a good use case for grammar parsers. You should read this question. That is basically the preprocessing that I mentioned. When using a grammar parser for indented text, you need this preprocessing. But if you have this preprocessing implemented, then the job is already done. You don't need much more. Your langage is indentation based but not grammar based. A grammar parser typically is not white space sensitive. But you need something, which is white space sensitive.
Upvotes: 0
Reputation: 35319
Instead of Pyparsing you could use PyYAML for this.
import yaml
f = open('yyy.yaml', 'r')
print yaml.load(f)
output:
{'comment': 'this object is created first',
'object': 'object_name',
'methods': {'method_name': {'input': {'arg1': 'arg_type', 'arg2': 'arg2_type'},
'output': None}, 'methond2_name': {'input': None, 'output': {'arg1': 'arg_type'}}}}
Upvotes: 3