Reputation: 1167
Trying to parse multiple selections over a multi-line document. Want to capture all lines between each of the keywords. Here's an example:
Keyword 1: CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT
Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
Keyword 4
I might also have
Keyword 1: CAPTURE THIS TEXT
CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT
Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
Keyword 4
My code looks like
from pyparsing import *
EOL = LineEnd().suppress()
line = OneOrMore(Group(SkipTo(LineEnd()) + EOL))
KEYWORD_CAPTURE_AREA = Keyword("Keyword 1:").suppress() + line + Keyword("Keyword 2:").suppress() + line \
+ Keyword("Keyword 3:").suppress() + line + Keyword("Keyword 4").suppress()
Current approach returns no results if my result goes across multiple lines. Assume that there should be a straightforward solution to this - just haven't found it.
Upvotes: 2
Views: 1130
Reputation: 63762
The concept to learn with pyparsing
is that each sub-expression runs on its own, not aware of any containing or following expressions. So when your line
is to match one or more "skip to the end of the current line", it doesn't know that it should stop when it sees the next "Keyword" string, and so it predictably reads to the end of the string. Then when the parser moves on to look for "Keyword 2:", it is already well past that point, and so raises an exception.
You need to tell OneOrMore
that it should stop parsing if it finds a "Keyword" at the beginning of a line, even if that would ordinarily match the repeating expression. A reasonable detection of the end of a block might be the word "Keyword" if found at the beginning of a line. (You could make it more detailed and match "Keyword" + integer + ":"
to make this really bulletproof.) Let's call this "start_of_block_marker":
start_of_block_marker = LineStart() + "Keyword"
To tell OneOrMore that this indicates a stop condition for its repetition, pass this expression as the stopOn
argument:
line = OneOrMore(Group(SkipTo(LineEnd()) + EOL),
stopOn=LineStart() + "Keyword")
Now this will parse all your strings, but you are grouping within the OneOrMore, when I think you really want all the substrings in a single group. Also, the blank line between 2 and 3 creates an extra empty line. Here is an improved version of line:
line = Optional(EOL) + Group(OneOrMore(SkipTo(LineEnd()) + EOL,
stopOn=LineStart() + "Keyword"))
I put your two test strings in a list, and then use it as an argument to runTests()
:
text1 = """\
Keyword 1: CAPTURE THIS TEXT
CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT
Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
Keyword 4"""
text2 = """\
Keyword 1: CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT
Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
Keyword 4
"""
KEYWORD_CAPTURE_AREA.runTests(tests)
Which prints (echoing each test, and then printing the parsed results):
Keyword 1: CAPTURE THIS TEXT
CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT
Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
Keyword 4
[['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']]
[0]:
['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']
[1]:
['CAPTURE THIS TEXT']
[2]:
['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']
Keyword 1: CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT
Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
Keyword 4
[['CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']]
[0]:
['CAPTURE THIS TEXT']
[1]:
['CAPTURE THIS TEXT']
[2]:
['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']
If there is an error in the results, runTests()
will display the problem line and location, and give the pyparsing
error message.
Upvotes: 3
Reputation: 2692
Does it have to be pyparsing
?
If not, you could use split, e.g.
f = open('sample.txt')
values = []
for text in f.read().split('Keyword '):
values.append(text[2:])
print(values)
>> ['', ' CAPTURE THIS TEXT\n CAPTURE THIS TEXT\n', ' CAPTURE THIS TEXT\n\n', '\nCAPTURE THIS TEXT\nCAPTURE THIS TEXT\nCAPTURE THIS TEXT\nCAPTURE THIS TEXT\n\n', '']
Upvotes: 0