Parsing text file in python using pyparsing

Question

I am trying to parse the following text using pyparsing.

acp (SOLO1,
     "solo-100",
     "hi here is the gift"
     "Maximum amount of money, goes",
     430, 90)

jhk (SOLO2,
     "solo-101",
     "hi here goes the wind."
     "and, they go beyond",
     1000, 320)

I have tried the following code but it doesn't work.

flag = Word(alphas+nums+'_'+'-')
enclosed = Forward()
nestedBrackets = nestedExpr('(', ')', content=enclosed)
enclosed << (flag | nestedBrackets)

print list(enclosed.searchString (str1))

The comma(,) within the quotation is producing undesired results.

PaulMcG · Accepted Answer

Well, I might have oversimplified slightly in my comments - here is a more complete answer.

If you don't really have to deal with nested data items, then a single-level parenthesized data group in each section will look like this:

LPAR,RPAR = map(Suppress, "()")
ident = Word(alphas, alphanums + "-_")
integer = Word(nums)

# treat consecutive quoted strings as one combined string
quoted_string = OneOrMore(quotedString)
# add parse action to concatenate multiple adjacent quoted strings
quoted_string.setParseAction(lambda t: '"' + 
                            ''.join(map(lambda s:s.strip('"\''),t)) + 
                            '"' if len(t)>1 else t[0])
data_item = ident | integer | quoted_string

# section defined with no nesting
section = ident + Group(LPAR + delimitedList(data_item) + RPAR)

I wasn't sure if it was intentional or not when you omitted the comma between two consecutive quoted strings, so I chose to implement logic like Python's compiler, in which two quoted strings are treated as just one longer string, that is "AB CD " "EF" is the same as "AB CD EF". This was done with the definition of quoted_string, and adding the parse action to quoted_string to concatenate the contents of the 2 or more component quoted strings.

Finally, we create a parser for the overall group

results = OneOrMore(Group(section)).parseString(source)
results.pprint()

and get from your posted input sample:

[['acp',
  ['SOLO1',
   '"solo-100"',
   '"hi here is the giftMaximum amount of money, goes"',
   '430',
   '90']],
 ['jhk',
  ['SOLO2',
   '"solo-101"',
   '"hi here goes the wind.and, they go beyond"',
   '1000',
   '320']]]

If you do have nested parenthetical groups, then your section definition can be as simple as this:

# section defined with nesting
section = ident + nestedExpr()

Although as you have already found, this will retain the separate commas as if they were significant tokens instead of just data separators.

Parsing text file in python using pyparsing

Answers (1)

Related Questions