mydriasis
mydriasis

Reputation: 71

Simple parser, but not a calculator

I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.

I have a very simple DSL, for example:

ELEMENT TYPE<TYPE> elemName {
    TYPE<TYPE> memberName;
}

Where the <TYPE> part is optional and valid only for some types.

Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:

  1. How do I look for tokens that are longer than 1 char?
  2. How do I break up the text in the different parts? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a <. How do I address that?

Upvotes: 1

Views: 657

Answers (2)

Olivier Melan&#231;on
Olivier Melan&#231;on

Reputation: 22294

Short answer

All your questions boil down to the fact that you are not tokenizing your string before parsing it.

Long answer

The process of parsing is actually split in two distinct parts: lexing and parsing.

Lexing

What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?

You can do it by yourself by checking your string against a list of regexp using re, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.

So proceeding with ComPyl, the syntax you are looking for seems to be the following.

from compyl.lexer import Lexer

rules = [
    (r'\s+', None),
    (r'\w+', 'ID'),
    (r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
    (r'{', 'L_BRACKET'),
    (r'}', 'R_BRACKET'),
]

lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here

Notice that the first rule (r'\s+', None), is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation.

Parsing

You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).

Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.

Upvotes: 3

spookylukey
spookylukey

Reputation: 6576

Olivier's answer regarding lexing/tokenizing and then parsing is helpful.

However, for relatively simple cases, some parsing tools are able to handle your kind of requirements without needing a separate tokenizing step. parsy is one of those. You build up parsers from smaller building blocks - there is good documentation to help.

An example of a parser done with parsy for your kind of grammar is here: http://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser . It is significantly more complex than yours, but shows what is possible. Where whitespace is allowed (but not required), it uses the lexeme utility (defined at the top) to consume optional whitespace.

You may need to tighten up your understanding of where whitespace is necessary and where it is optional, and what kind of whitespace you really mean.

Upvotes: 2

Related Questions