Reputation: 470
I cannot figure out how to group zero or more repeating sections in text with pyparsing. In other words, I want to merge multiple matches into one named result set. Note, I want to use pyparsing as I have a lot of sections with different rules.
from pyparsing import *
input_text = """
Projects
project a created in c#
Education
university of college
Projects
project b created in python
"""
project_marker = LineStart() + Literal('Projects') + LineEnd()
education_marker = LineStart() + Literal('Education') + LineEnd()
markers = project_marker ^ education_marker
project_section = Group(
project_marker + SkipTo(markers | stringEnd).setResultsName('project')
).setResultsName('projects')
education_section = Group(
education_marker + SkipTo(markers | stringEnd).setResultsName('education')
).setResultsName('educations')
sections = project_section ^ education_section
text = StringStart() + SkipTo(sections | StringEnd())
doc = Optional(text) + ZeroOrMore(sections)
result = doc.parseString(input_text)
print(result)
# ['', ['Projects', '\n', 'project a created in c#'], ['Education', '\n', 'virginia tech'], ['Projects', '\n', 'project b created in python']]
print(result.projects)
# ['Projects', '\n', 'project b created in python']
print(result.projects[0].project)
# AttributeError: 'str' object has no attribute 'project'
Upvotes: 2
Views: 803
Reputation: 470
Thanks to @PaulMcG the solution is to add listAllMatches=True
to setResultsName
, see https://pythonhosted.org/pyparsing/pyparsing.ParserElement-class.html#setResultsName.
project_marker = LineStart() + Literal('Projects') + LineEnd()
education_marker = LineStart() + Literal('Education') + LineEnd()
markers = project_marker ^ education_marker
project_section = Group(
project_marker + SkipTo(markers | stringEnd).setResultsName('project')
).setResultsName('projects', listAllMatches=True)
education_section = Group(
education_marker + SkipTo(markers | stringEnd).setResultsName('education')
).setResultsName('educations', listAllMatches=True)
sections = project_section ^ education_section
text = StringStart() + SkipTo(sections | StringEnd())
doc = Optional(text) + ZeroOrMore(sections)
result = doc.parseString(input_text)
Upvotes: 0
Reputation: 21643
Here's my tentative answer, not that I'm proud of it. I cribbed a chunk of it from https://stackoverflow.com/a/5824309/131187.
>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])('section') + EOL + lines('lines')
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)
As you can see just below this sentence, the parser succeeds in gathering the information correctly, and in gathering it in such a way that it can be assembled, as will be shown presently. However, I cannot find a way of accessing all of the results from parseString
that are so clearly available.
I resorted to applying eval
to its repr
representation. Having done that I was able to pick out all of the pieces and assign them to a dict-like object.
To be honest, this would be easier to do without pyparsing. Read a line, note whether it's a keyword. If it is, remember it. Then until you read another keyword just place all lines you read in a dictionary under the most recent keyword.
>>> repr(r)
"(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})"
>>> evil_r = eval(repr(r))
>>> evil_r
(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})
>>> evil_r[1]['lines']
[(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})]
>>> evil_r[1]['section']
['Projects', 'Education', 'Projects']
>>> from collections import defaultdict
>>> section_info = defaultdict(list)
>>> for k, kind in enumerate(evil_r[1]['section']):
... section_info[kind].extend(evil_r[1]['lines'][k][0][:-1])
>>> for section in section_info:
... section, section_info[section]
...
('Education', ['institution 1', 'institution 2', 'institution 3', 'institution 4'])
('Projects', ['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10'])
EDIT: Or you could do this. Needs tidying up. At least it doesn't use anything unorthodox.
>>> input_text = open('temp.txt').read()
>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> from collections import defaultdict
>>> class Accum:
... def __init__(self):
... self.current_section = None
... self.result = defaultdict(list)
... def __call__(self, s):
... if s[0] in ['Projects', 'Education']:
... self.current_section = s[0]
... else:
... self.result[self.current_section].extend(s[:-1])
...
>>> accum = Accum()
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]).setParseAction(accum) + EOL + lines.setParseAction(accum)
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)
>>> accum.result['Education']
['institution 1', 'institution 2', 'institution 3', 'institution 4']
>>> accum.result['Projects']
['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10']
Upvotes: 2