Capturing block over multiple lines using pyparsing

Question

Trying to parse multiple selections over a multi-line document. Want to capture all lines between each of the keywords. Here's an example:

Keyword 1: CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4

I might also have

Keyword 1: CAPTURE THIS TEXT
           CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4

My code looks like

from pyparsing import *

EOL = LineEnd().suppress()
line = OneOrMore(Group(SkipTo(LineEnd()) + EOL))

KEYWORD_CAPTURE_AREA = Keyword("Keyword 1:").suppress() + line + Keyword("Keyword 2:").suppress() + line \
                    + Keyword("Keyword 3:").suppress() + line + Keyword("Keyword 4").suppress()

Current approach returns no results if my result goes across multiple lines. Assume that there should be a straightforward solution to this - just haven't found it.

PaulMcG · Accepted Answer

The concept to learn with pyparsing is that each sub-expression runs on its own, not aware of any containing or following expressions. So when your line is to match one or more "skip to the end of the current line", it doesn't know that it should stop when it sees the next "Keyword" string, and so it predictably reads to the end of the string. Then when the parser moves on to look for "Keyword 2:", it is already well past that point, and so raises an exception.

You need to tell OneOrMore that it should stop parsing if it finds a "Keyword" at the beginning of a line, even if that would ordinarily match the repeating expression. A reasonable detection of the end of a block might be the word "Keyword" if found at the beginning of a line. (You could make it more detailed and match "Keyword" + integer + ":" to make this really bulletproof.) Let's call this "start_of_block_marker":

start_of_block_marker = LineStart() + "Keyword"

To tell OneOrMore that this indicates a stop condition for its repetition, pass this expression as the stopOn argument:

line = OneOrMore(Group(SkipTo(LineEnd()) + EOL), 
                 stopOn=LineStart() + "Keyword")

Now this will parse all your strings, but you are grouping within the OneOrMore, when I think you really want all the substrings in a single group. Also, the blank line between 2 and 3 creates an extra empty line. Here is an improved version of line:

line = Optional(EOL) + Group(OneOrMore(SkipTo(LineEnd()) + EOL,
                             stopOn=LineStart() + "Keyword"))

I put your two test strings in a list, and then use it as an argument to runTests():

text1 = """\
Keyword 1: CAPTURE THIS TEXT
           CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4"""

text2 = """\
Keyword 1: CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4
"""
KEYWORD_CAPTURE_AREA.runTests(tests)

Which prints (echoing each test, and then printing the parsed results):

Keyword 1: CAPTURE THIS TEXT
           CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4
[['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']]
[0]:
  ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']
[1]:
  ['CAPTURE THIS TEXT']
[2]:
  ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']


Keyword 1: CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4

[['CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']]
[0]:
  ['CAPTURE THIS TEXT']
[1]:
  ['CAPTURE THIS TEXT']
[2]:
  ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']

If there is an error in the results, runTests() will display the problem line and location, and give the pyparsing error message.

Capturing block over multiple lines using pyparsing

Answers (2)

Related Questions