rtsrsquared
rtsrsquared

Reputation: 23

Python: how to do this complex multiline regex involving escapes?

I have a file that looks like this:

...

- family:
  - home: house
    location: 53rd street|Austin|Texas|U.S
    type: old
original entry: '544'
  issues:
  - plumbing: fixed
    ref:
    - id: 28
      cost: 23 USD

- family:
  - home: house
    location: 53rd street|Austin|Texas|U.S
    type: old
original entry: '545'
  issues:
  - plumbing: fixed
    ref:
    - id: 1081
      cost: 33 USD

 ...

This file has hundreds of similar entries on other families.

I want to make it look like this:

- family:
  - home: house
    location: 53rd street|Austin|Texas|U.S
    type: old
original entry: '544'
  issues:
  - plumbing: fixed
    ref:
    - id: 28
      cost: 23 USD
    - id: 1081
      cost: 33 USD

I have tried making a multiline regex where I just find the text in the middle and replace it with nothing. Here is the pattern I attempted:

pattern = "r'\s- family:\n\s+- home: house\n\s+tag: 53rd street|Austin|Texas|U.S\n\s+type: old\n\original entry: \'554\'\n\s+issues:\n\s+- plumbing: fixed\n\s+ref:'"

This did not seem to work. I tried one of those online regex tools that suggested:

pattern = "r'\s- family:\n\s+- home: house\n\s+tag: 53rd street\|Austin\|Texas\|U.S\n\s+type: old\n\original entry: '554'\n\s+issues:\n\s+- plumbing: fixed\n\s+ref:'"

This also did not appear to work. I have used my multiline regex function on simpler cases without a problem, so I know the regex code itself works. It is just that it seems a bit tricky getting a pattern that works.

I figure there must be some stuff that is not getting escaped correctly, or escaped too much. Also, this strategy does not seem to get both of the original entry numbers after each other.

Is there a way this can be done? I guess one can just use the entire two blocks as the pattern, and the result as the replacement text, but that seems even more bulkier and difficult...

Upvotes: 0

Views: 47

Answers (1)

Bill Bell
Bill Bell

Reputation: 21663

The parser for doing this using pyparser is uncomplicated. Here, it's declared as the name p. Each line is defined to be everything up to an end-line followed by an end-line, and the entire file consists of OneOrMore of these. Since pyparsing ignores white space by default the empty lines disappear.

>>> import pyparsing as pp
>>> theFile = open('temp.txt').read()
>>> p = pp.OneOrMore(pp.Combine(pp.restOfLine+pp.Suppress('\n')))
>>> for item in p.parseString(theFile):
...     item
... 
'- family:'
'- home: house'
'location: 53rd street|Austin|Texas|U.S'
'type: old'
"original entry: '544'"
'issues:'
'- plumbing: fixed'
'ref:'
'- id: 28'
'cost: 23 USD'
'- family:'
'- home: house'
'location: 53rd street|Austin|Texas|U.S'
'type: old'
"original entry: '545'"
'issues:'
'- plumbing: fixed'
'ref:'
'- id: 1081'
'cost: 33 USD'

Upvotes: 1

Related Questions