RunDeep
RunDeep

Reputation: 306

MatchFirst not passing to second ParseExpression?

MatchFirst seems to not pass to the next ParseExpression when it seems like the first ParseExpression should have failed.

I have a file (BOM extract from OrCAD) that has a header, lines with component info and continuation lines for part references:

(named test_string_body, tabs are used in the component section for spacing)

SCH, WACI  Revised: Wednesday, March 29, 2017
357403-01          Revision: A

Bill Of Materials          March 29,2017      17:53:04  Page1

Item    P/N Quantity    Value   PCB Footprint   Part Reference
______________________________________________

1   177347  5   100P    capc1608_is0603n    C1,C2,C3,C4,C5
2   176054  9   1.0uF   capc3216_is1206n    C6,C23,C32,C88,C95,C98,
    C99,C140,C141
3   177606  31  100P    capc1005_is0402n    C7,C8,C9,C10,C11,C12,C13,
    C14,C15,C16,C53,C56,C64,
    C69,C261,C262,C263,C268,

Visible whitespace

For parsing the full lines I use:

grammer_line_full = (LineStart() + Word(nums, min=1)('cmpt_item') + 
                     Word(nums)('cmpt_part_num') + 
                     Word(nums)('cmpt_qty') +
                     Word(printables)('cmpt_value') +
                     Word(alphanums + '_')('cmpt_footprint') +
                     Word(alphanums + ',')('cmpt_references1')
                    )

and for the continuation lines:

grammer_line_short = White('\t', exact=5) + Word(alphanums + ',')('cmpt_references2')

if I set:

grammer_body = grammer_line_full

or I set:

grammer_body = grammer_line_short 

I get the result I am expecting (just the appropriate lines):

for match, start, stop in grammer_body.parseWithTabs().scanString(test_string_body):
    print(match)

If I set:

grammer_body = grammer_line_full | grammer_line_short

I only get the full lines?

grammer_line_full or grammer_line_full | grammer_line_short:

['1', '177347', '5', '100P', 'capc1608_is0603n', 'C1,C2,C3,C4,C5']
['2', '176054', '9', '1.0uF', 'capc3216_is1206n', 'C6,C23,C32,C88,C95,C98,']
['3', '177606', '31', '100P', 'capc1005_is0402n', 'C7,C8,C9,C10,C11,C12,C13,']...

Just grammer_line_short:

['\t\t\t\t\t', 'C99,C140,C141']
['\t\t\t\t\t', 'C14,C15,C16,C53,C56,C64,']
['\t\t\t\t\t', 'C69,C261,C262,C263,C268,']...

If I delete

White('\t', exact=5) +

from grammer_line_short, it finds the continuation lines but it also matches a bunch of stuff from the header:

...
['Part']
['Reference']
['1', '177347', '5', '100P', 'capc1608_is0603n', 'C1,C2,C3,C4,C5']
['2', '176054', '9', '1.0uF', 'capc3216_is1206n', 'C6,C23,C32,C88,C95,C98,']
['C99,C140,C141']...

I added:

+ White('\t', exact=1).suppress()

to each of the elements in grammer_line_full and it didn't change anything.

I end up concatenating continuation line part references with the full line values so I think I need to parse them separately. My end goal is parse all the header info (code not shown, have a parser for it) and all the component info.

I know working with whitespaces is not preferred but it seems the best way to handle this kind of format except that it doesn't work for me...

Upvotes: 1

Views: 30

Answers (1)

PaulMcG
PaulMcG

Reputation: 63719

I suspect that the MatchFirst expression is implicitly skipping the whitespace at the beginning of the continuation lines. Try doing this (untested):

grammer_body = (grammer_line_full | grammer_line_short).leaveWhitespace()

Upvotes: 1

Related Questions