carrieje
carrieje

Reputation: 495

Pyparsing : white spaces sometimes matter... sometimes don't

I would like to create a grammar for a file that contains several sections (like PARAGRAPH below).

A section starts with its keyword (e.g. PARAGRAPH), is followed by a header (title here) and has its contents on the following lines, one line of content is a row of the section. As is, it is like a table with header, columns and rows.

In the example below (tablefile), I will limit the sections to have one column and one line.

Top-Down BNF of Tablefile:

tablefile := paragraph*
paragraph := PARAGRAPH title CR
             TAB content
title, content := \w+

Pyparsing grammar :

As I need line breaks and tabultation to be handled, I will need to set default whitespaces to ' '.

def grammar():
    '''
    Bottom-up grammar definition
    '''

    ParserElement.setDefaultWhitespaceChars(' ')
    TAB = White("\t").suppress()
    CR = LineEnd().setName("Carriage Return").suppress()
    PARAGRAPH = 'PARAGRAPH'

    title = Word(alphas)
    content = Word(alphas)
    paragraph = (PARAGRAPH + title + CR
                 + TAB + content)

    tablefile = OneOrMore(paragraph)
    tablefile.parseWithTabs()

    return tablefile

Applying to examples

This dummy example matches easily :

PARAGRAPH someTitle
          thisIsContent

This other less :

PARAGRAPH someTitle
          thisIsContent
PARAGRAPH otherTitle
          thisIsOtherContent

It waits for PARAGRAPH right after the first content, and stumble upon a line break (remember setDefaultWhitespaceChars(' ')). Am I compelled to add CR? at the end of a paragraph ? What would be a better way to ignore such last line breaks ?

Also, I would like to allow tabs and spaces to be anywhere in the file without disturbance. The only needed behavior is to starts a paragraph content with TAB, and PARAGRAPH to start the line. That would also mean skipping blank lines (with tabs and spaces or nothing) in and between paragraphs.

Thus I added this line :

tablefile.ignore(LineStart() + ZeroOrMore(White(' \t')) + LineEnd())

But every demand I just exposed, seems to be against my need of setting default whitespaces to ' ' and put me into a dead end.

Indeed, this would cause everything to break down :

tablefile.ignore(CR)
tablefile.ignore(TAB)

Glue PARAGRAPH and TAB to the start of line

If I want \t to be ignored as wherever in the text but at the start of lines. I will have to add them to the default white space characters.

Thus, I have found a way to forbid every white space character at the start of the line. By using leaveWhitespace method. This method keeps the whitespaces it encounters before matching the token. Hence, I can glue some tokens to the start of line.

ParserElement.setDefaultWhitespaceChars('\t ')
SOL = LineStart().suppress()
EOL = LineEnd().suppress()

title = Word()
content = Word()
PARAGRAPH = Keyword('PARAGRAPH').leaveWhitespace()
TAB = Literal('\t').leaveWhitespace()

paragraph = (SOL + PARAGRAPH + title + EOL
             + SOL + TAB + content + EOL)

With this solution, I solved my problem with TABs wherever in the text.

Separating paragraphs

I reached the solution of PaulMcGuire (delimitedList) after a bit of thinking. And I encountered some issue with it.

Indeed, here are two different way of declaring line break separators between two paragraphs. In my opinion, they should be equivalent. In practice, they are not?

Crash test (don't forget to change the spaces with tabs if you run it):

PARAGRAPH titleone
          content1
PARAGRAPH titletwo
          content2

Common part between the two examples :

ParserElement.setDefaultWhitespaceChars('\t ')
SOL = LineStart().suppress()
EOL = LineEnd().suppress()

title = Word()
content = Word()
PARAGRAPH = Keyword('PARAGRAPH').leaveWhitespace()
TAB = Literal('\t').leaveWhitespace()

First example, working one :

paragraph = (SOL + PARAGRAPH + title + EOL
            + SOL + TAB + content + EOL)

tablefile = ZeroOrMore(paragraph)

Second example, not working :

paragraph = (SOL + PARAGRAPH + title + EOL
            + SOL + TAB + content)

tablefile = delimitedList(paragraph, delim=EOL)

Shouldn't they be equivalent ? The second raise exception :

Expected end of text (at char 66), (line:4, col:1)

It is not a big issue for me, as I can finally back off to put EOL at the end of every paragraph-like section of my grammar. But I wanted to highlight this point.

Ignoring blank line containing white spaces

Another demand I had, was to ignore blank lines, containing whitespaces (' \t').

A simple grammar for this would be :

ParserElement.setDefaultWhitespaceChars(' \t')
SOL = LineStart().suppress()
EOL = LineEnd().suppress()

word = Word('a')
entry = SOL + word + EOL

grammar = ZeroOrMore(entry)
grammar.ignore(SOL + EOL)

At the end, the file can contain one word per line, with any whitespace anywhere. And it should ignore blank lines.

Happily, it does. But it is not affected by default whitespaces declaration. And a blank line containing spaces or tabs will cause the parser to raise a parsing exception.

This behavior is absolutely not the one I expected. Is it the specified one ? Is there a bug under this simple attempt ?

I can see in this thread that PaulMcGuire did not tried to ignore blank lines but to tokenize them instead, in a makefile-like grammar parser (NL = LineEnd().suppress()).

Any python module for customized BNF parser?

makefile_parser = ZeroOrMore( symbol_assignment
                             | task_definition
                             | NL )

The only solution I have for now, is to preprocess the file and remove the whitespaces contained in a blank line as pyparsing correctly ignores blank line with no whitespace in it.

import os
preprocessed_file = os.tmpfile()    
with open(filename, 'r') as file:
    for line in file:
        # Use rstrip to preserve heading TAB at start of a paragraph line
        preprocessed_file.write(line.rstrip() + '\n')
preprocessed_file.seek(0)

grammar.parseFile(preprocessed_file, parseAll=True)

Upvotes: 37

Views: 5136

Answers (1)

Cees Timmerman
Cees Timmerman

Reputation: 19644

Your BNF contains only CR, but you parse the code to terminate using LF. Is that intended? BNF supports LF (Unix), CR (Mac), and CRLF (Win) EOLs:

Rule_|_Def.__|_Meaning___
CR   | %x0D  | carriage return
LF   | %x0A  | linefeed
CRLF | CR LF | Internet standard newline

Upvotes: 2

Related Questions