Reputation: 23
I have a file that looks like this:
...
- family:
- home: house
location: 53rd street|Austin|Texas|U.S
type: old
original entry: '544'
issues:
- plumbing: fixed
ref:
- id: 28
cost: 23 USD
- family:
- home: house
location: 53rd street|Austin|Texas|U.S
type: old
original entry: '545'
issues:
- plumbing: fixed
ref:
- id: 1081
cost: 33 USD
...
This file has hundreds of similar entries on other families.
I want to make it look like this:
- family:
- home: house
location: 53rd street|Austin|Texas|U.S
type: old
original entry: '544'
issues:
- plumbing: fixed
ref:
- id: 28
cost: 23 USD
- id: 1081
cost: 33 USD
I have tried making a multiline regex where I just find the text in the middle and replace it with nothing. Here is the pattern I attempted:
pattern = "r'\s- family:\n\s+- home: house\n\s+tag: 53rd street|Austin|Texas|U.S\n\s+type: old\n\original entry: \'554\'\n\s+issues:\n\s+- plumbing: fixed\n\s+ref:'"
This did not seem to work. I tried one of those online regex tools that suggested:
pattern = "r'\s- family:\n\s+- home: house\n\s+tag: 53rd street\|Austin\|Texas\|U.S\n\s+type: old\n\original entry: '554'\n\s+issues:\n\s+- plumbing: fixed\n\s+ref:'"
This also did not appear to work. I have used my multiline regex function on simpler cases without a problem, so I know the regex code itself works. It is just that it seems a bit tricky getting a pattern that works.
I figure there must be some stuff that is not getting escaped correctly, or escaped too much. Also, this strategy does not seem to get both of the original entry numbers after each other.
Is there a way this can be done? I guess one can just use the entire two blocks as the pattern, and the result as the replacement text, but that seems even more bulkier and difficult...
Upvotes: 0
Views: 47
Reputation: 21663
The parser for doing this using pyparser is uncomplicated. Here, it's declared as the name p
. Each line is defined to be everything up to an end-line followed by an end-line, and the entire file consists of OneOrMore
of these. Since pyparsing ignores white space by default the empty lines disappear.
>>> import pyparsing as pp
>>> theFile = open('temp.txt').read()
>>> p = pp.OneOrMore(pp.Combine(pp.restOfLine+pp.Suppress('\n')))
>>> for item in p.parseString(theFile):
... item
...
'- family:'
'- home: house'
'location: 53rd street|Austin|Texas|U.S'
'type: old'
"original entry: '544'"
'issues:'
'- plumbing: fixed'
'ref:'
'- id: 28'
'cost: 23 USD'
'- family:'
'- home: house'
'location: 53rd street|Austin|Texas|U.S'
'type: old'
"original entry: '545'"
'issues:'
'- plumbing: fixed'
'ref:'
'- id: 1081'
'cost: 33 USD'
Upvotes: 1