Cory
Cory

Reputation: 15615

How to create a grammar to the following data using Pyparsing

I have data similar to YAML and need to create a grammar for it using Pyparsing. Like Python, Yaml's data scope is defined by the whitespace

data:

object : object_name 
comment : this object is created first 
methods:   
  method_name:
    input: 
      arg1: arg_type
      arg2: arg2_type
    output:   

  methond2_name:
    input:
    output:
      arg1 : arg_type

After parsing the above, it should output something similar to this:

{'comment': 'this object is created first',
 'object': 'object_name',
 'methods': {'method_name': {'input': {'arg1': 'arg_type', 'arg2': 'arg2_type'}, 
 'output': None}, 'methond2_name': {'input': None, 'output': {'arg1': 'arg_type'}}}}

[EDIT] The data is similar to YAML but not exactly the same. So YAML Python parser is not able to parse it. I left of some of the details to make the example data simpler

Upvotes: 5

Views: 964

Answers (2)

habrewning
habrewning

Reputation: 1069

This cannot be parsed with Pyparsing. There are a few principle problems. YAML cannot be parsed with a grammar parser. I will illustrate this with Python code.

It is easy to see why Python code cannot be parsed with a grammar parser. See this example:

def a():
  def b():
    def c():
      x = (3 +
          3)

The white spaces that make the indentation matter in Python. So if you want to parse this with a grammar, the parse tree must somehow contain these white spaces. The parse tree of the (3 + 3) expression would be something like <open-paren> 3 <add> <indent> <indent> <indent> 3 <close-paren>. But these indents have nothing to do with the that + expression. The principal problem with a grammar parser would be that it sees the spaces where they are, and this is inside of expressions. But semantically they don't belong to that expression. They belong to the def instead. The grammar does not see them, where they are needed. And a pure grammar parser is unable to move things around, so there is no solution.

Luckily Pyparsing has a workaround for this principal problem. I explain this again with Python code. The Python compiler when parsing Python code, first applies a preprecessing (Don't confuse this with lexing!). It replaces the spaces with some sort of special symbols. That yields something like this.

def a():
<indent>
def b():
<indent>
def c():
<indent>
x = (3 + 3)
<dedent>
<dedent>
<dedent>

This version now can be parsed with a grammar parser.

If you want to go that way in Pyparsing, you can use the function IndentedBlock. You find an example here.

But there are more problems. I guess you want that the following two versions are identical.

object : object_name 
comment : this object is created first 
methods:   
  method_name:
    input: 
      arg1: arg_type
      arg2: arg2_type
    output:   
object : object_name 
methods:   
  method_name:
    input: 
      arg1: arg_type
      arg2: arg2_type
    output:   
comment : this object is created first 

And I guess you also want that the following is illegal.

methods:   
  method_name:
    input: 
      arg1: arg_type
      arg2: arg2_type
    output:   
object : object_name 
methods:   
  method_name:
    input: 
      arg1: arg_type
      arg2: arg2_type
    output:   

Assume you have grammar parsers for the methods block, the object block and the comment block. How can we combine these parsers? A grammar parser has no memory. When the first methods block is parsed, it somehow must memorise it, so that another methods block, that could come later is not accepted.

In order to workaround this principal problem, you would have to define the combination in a way like this:

my_parser = (objects + methods + comment)
            | (methods + objects + comment)
            | (comment + methods + object)
            | (comment + object + methods)

You can imagine, that this is not practicable.

If you want a fixed order (object + comment + methods) of course it makes more sense to use Pyparsing. But YAML is not ordered.

Finally, let me say, that Pyparsing is made for grammar parsing. This is where Pyparsing is strong. The text that you have in the question is not a good use case for grammar parsers. You should read this question. That is basically the preprocessing that I mentioned. When using a grammar parser for indented text, you need this preprocessing. But if you have this preprocessing implemented, then the job is already done. You don't need much more. Your langage is indentation based but not grammar based. A grammar parser typically is not white space sensitive. But you need something, which is white space sensitive.

Upvotes: 0

fraxel
fraxel

Reputation: 35319

Instead of Pyparsing you could use PyYAML for this.

import yaml
f = open('yyy.yaml', 'r')
print yaml.load(f)

output:

{'comment': 'this object is created first',
 'object': 'object_name',
 'methods': {'method_name': {'input': {'arg1': 'arg_type', 'arg2': 'arg2_type'}, 
 'output': None}, 'methond2_name': {'input': None, 'output': {'arg1': 'arg_type'}}}}

Upvotes: 3

Related Questions