Reputation: 631
I've got a generic "text block" element, for which I copied the whitespace-stripping code from the documentation:
import pyparsing as pp
text_block = pp.Group(
pp.OneOrMore(
pp.SkipTo(pp.LineEnd()) + pp.LineEnd().suppress(),
stopOn=pp.StringEnd() | (pp.LineStart() + (pp.Literal("E)") | pp.Literal("F)")))
)
).set_parse_action(pp.token_map(str.strip))
Unfortunately this returns an error:
FAIL-EXCEPTION: TypeError: descriptor 'strip' for 'str' objects doesn't apply to a 'ParseResults' object
I replaced the use of token_map
with a function:
def _strip_whitespace(tokens):
return [token.str.strip() for token in tokens]
text_block = pp.Group(
pp.OneOrMore(
pp.SkipTo(pp.LineEnd()) + pp.LineEnd().suppress(),
stopOn=pp.StringEnd() | (pp.LineStart() + (pp.Literal("E)") | pp.Literal("F)")))
)
).set_parse_action(_strip_whitespace)
...but now it deletes all the text(!)
Upvotes: 1
Views: 51
Reputation: 2104
Pyparsing by default automatically handles whitespace stripping in ParseResults.
The pyparsing module handles some of the problems that are typically vexing when writing text parsers:
- extra or missing whitespace (the above program will also handle “Hello,World!”, “Hello , World !”, etc.)
- quoted strings
- embedded comments
So except in situations where you configure/instruct it to specifically capture whitespace, there's no need to strip whitespace as it will normally be auto-stripped.
SkipTo()
is one of the ParserElements
that can capture whitespace, because it grabs everything between the current parse location and the skipped-to character. That's why, in the table-parsing example for SkipTo
, they use token_map(str.strip)
in the parse action for the SkipTo
.
But only the SkipTo
element of the parser will require stripping, and you're trying to strip all of the tokens which is a very bad idea since, except in well-defined cases, there won't be any whitespace. So if you're going to apply str.strip
in a parse action, it should be just on the SkipTo
.
The AtLineStart
example they give is more typical for such matching, matching each line via pp.rest_of_line
(which will also grab whitespace):
test = '''\
AAA this line
AAA and this line
AAA but not this one
B AAA and definitely not this one
'''
for t in (AtLineStart('AAA') + rest_of_line).search_string(test):
print(t)
For your text blocks, you could just grab ALL of the lines in the block by skipping to the end marker, and then deal with the stripping and splitting when you process the results:
block_start_marker = pp.AtLineStart("(E")
text_block = (
block_start_marker.suppress()
+ pp.SkipTo(
pp.AtLineStart("E)") | pp.AtLineStart("F)")
)
)
You'll get a single result with the entire block (including surrounding newlines):
>>> text_block.parse_string("\n".join([
... "(E",
... " This is ",
... " My text block ",
... "E)",
... ]))
ParseResults(['\n This is \n My text block \n'], {})
Or, you can do the line-by-line parsing thing, but you'd want to apply the parse action only to the SkipTo()
elements:
block_end_marker = pp.AtLineStart("E)") | pp.AtLineStart("F)")
text_block = (
block_start_marker.suppress()
+ pp.OneOrMore(
pp.SkipTo("\n").set_parse_action(pp.token_map(str.strip)),
stop_on=block_end_marker
)
)
>>> text_block.parse_string("\n".join([
... "(E",
... " This is ",
... " My text block ",
... "E)",
... ]))
ParseResults(['This is', 'My text block'], {})
Upvotes: 1