Reputation: 130
I am a newbie to pyparsing and have been reading the examples, looking here and trying some things out. I created a grammar and provided a buffer. I do however have a heavy background in lex/yacc from the old days.
I have a general question or two.
I'm currently seeing
ParseException: Expected end of line (at char 7024), (line 213, col:2)
and then it terminates
Because of the nature of my buffer, newlines have meaning, I did:
ParserElement.setDefaultWhitespaceChars('') # <-- zero len string
Does this error mean that somewhere in my productions, I have a rule that is looking for an LineEnd()
and that rule happens to somehow be 'last'?
The location it is dying is the 'end of file'. I tried using parseFile
but my file contains chars > ord(127) so instead I am loading it to memory, filtering all > ord(127) chars, then calling parseString
.
I tried turning on verbose_stacktrace=True
for some of the elements of my grammar where I thought the problem originated.
Is there a better way to track down the exact ParserElement
it is trying to recognize when an error such as this occurs? Or can I get a 'stack or most recently recognized production trace?
I didn't realize I could edit up here... My crash is this:
[centos@new-host /tmp/sample]$ ./zooparser.py
!(zooparser.py) TEST test1: valid message type START
Ready to roll
Parsing This message: ( ignore leading>>> and trailing <<< ) >>>
ZOO/STATUS/FOOD ALLOCATION//
TOPIC/BIRD FEED IS RUNNING LOW//
FREE/WE HAVE DISCOVERED MOTHS INFESTED THE BIRDSEED AND IT IS NO
LONGER USABLE.//
<<<
Match {Group:({Group:({Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!@#$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!@#$...)}}) "//"}) Group:({LineEnd "TOPIC" {Group:({[LineEnd]... Group:({"/" {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)}) | Group:({{{"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ'"}... | Group:({{"0123456789"}... ":"})} {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)})}}) | "-"}})})}... [LineEnd]... "//"})}) [Group:({LineEnd "FREE" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!@#$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!@#$...)}}) "//"})]...}) [LineEnd]... StringEnd} at loc 0(1,1)
Match Group:({Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!@#$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!@#$...)}}) "//"}) Group:({LineEnd "TOPIC" {Group:({[LineEnd]... Group:({"/" {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)}) | Group:({{{"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ'"}... | Group:({{"0123456789"}... ":"})} {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)})}}) | "-"}})})}... [LineEnd]... "//"})}) at loc 0(1,1)
Match Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!@#$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | " " | W:(!@#$...)}}) "//"}) at loc 0(1,1)
Exception raised:None
Exception raised:None
Exception raised:None
Traceback (most recent call last):
File "./zooparser.py", line 319, in <module>
test1(pgm)
File "./zooparser.py", line 309, in test1
test(pgm, zooMsg, 'test1: valid message type' )
File "./zooparser.py", line 274, in test
tokens = zg.getTokensFromBuffer(fileName)
File "./zooparser.py", line 219, in getTokensFromBuffer
tokens = self.text.parseString(filteredBuffer,parseAll=True)
File "/usr/local/lib/python2.7/site-packages/pyparsing-1.5.7-py2.7.egg/pyparsing.py", line 1006, in parseString
raise exc
pyparsing.ParseException: Expected end of line (at char 148), (line:8, col:2)
[centos@new-host /tmp/sample]$
source: see http://prj1.y23.org/zoo.zip
Upvotes: 3
Views: 5432
Reputation: 63709
pyparsing takes a different view toward parsing than lex/yacc does. You have to let the classes do some of the work. Here's an example in your code:
self.columnHeader = OneOrMore(self.aucc) \
| OneOrMore(nums) \
| OneOrMore(self.blankCharacter) \
| OneOrMore(self.specialCharacter)
You are equating OneOrMore
with the '+' character of a regex. In pyparsing, this is true for ParseElements, but at the character level, pyparsing uses the Word
class:
self.columnHeader = Word(self.aucc + nums + self.blankCharacter + self.specialCharacter)
OneOrMore
works with ParseElements, not characters. Look at:
OneOrMore(nums)
nums
is the string "0123456789", so OneOrMore(nums)
will match "0123456789", "01234567890123456789", etc., but not "123". That is what Word
is for. OneOrMore
will accept a string argument, but will implicitly convert it to a Literal
.
This is a fundamental difference between using pyparsing and lex/yacc, and I think is the source of much of the complexity in your code.
Some other suggestions:
Your code has some premature optimizations in it - you write:
aucc = ''.join(set([alphas.upper(),"'"]))
Assuming that this will be used for defining Words, just do:
aucc = alphas.upper() + "'"
There is no harm in having duplicate characters in aucc, Word
will convert this to a set internally.
Write a BNF for what you want to parse. It does not have to be overly rigorous as you would with lex/yacc. From your samples, it looks something like:
# sample
ZOO/STATUS/FOOD ALLOCATION//
TOPIC/BIRD FEED IS RUNNING LOW//
FREE/WE HAVE DISCOVERED MOTHS INFESTED THE BIRDSEED AND IT IS NO
LONGER USABLE.//
parser :: header topicEntry+
header :: "ZOO" sep namedValue
namedValue :: uppercaseWord sep valueBody
valueBody :: (everything up to //)
topicEntry :: topicHeader topicBody
topicHeader :: "TOPIC" sep valuebody
topicBody :: freeText
freeText :: "FREE" sep valuebody
sep :: "/"
Converting to pyparsing, this looks something like:
SEP = Literal("/")
BODY_TERMINATOR = Literal("//")
FREE_,TOPIC_,ZOO_ = map(Keyword,"FREE TOPIC ZOO".split())
uppercaseWord = Word(alphas.upper())
valueBody = SkipTo(BODY_TERMINATOR) # adjust later, but okay for now...
freeText = FREE_ + SEP + valueBody
topicBody = freeText
topicHeader = TOPIC_ + SEP + valueBody
topicEntry = topicHeader + topicBody
namedValue = uppercaseWord + SEP + valueBody
zooHeader = ZOO_ + SEP + namedValue
parser = zooHeader + OneOrMore(topicEntry)
(valueBody
will have to get more elaborate when you add support for '://' embedded within a value, but save that for Round 2.)
Don't make things super complicated until you get at least some simple stuff working.
Upvotes: 3