Reputation: 21443
I have text that contains several xml blocks with metadata above it, like this:
Block 1
2017-02-01 12:00
<?xml version="1.0" encoding="UTF-8"?>
<block>
<elt>text</elt>
<elt>more text</elt>
<block>
<elt>words</elt>
</block>
</block>
Block 2
2017-02-01 12:15
<?xml version="1.0" encoding="UTF-8"?>
<block>
<block>
<elt>text</elt>
<block>
<elt>words</elt>
</block>
<elt>more text</elt>
</block>
<elt>word</elt>
</block>
I need to pull out the xml text and skip over the metadata. I can do it iteratively like this:
messages = []
while True:
start = xml.find('<?xml')
if start == -1:
break
xml = xml[start:]
end = xml.find('\n\n')
if end == -1:
messages.append(xml)
break
else:
messages.append(xml[:end])
xml = xml[end:]
But I'd like to use a regular expression instead. The problem I'm having is that I need to be able to match either 2 consecutive line breaks (\n\n
) or the end of the string (\Z
). I'm having trouble there. I've tried this:
re.findall('<\?xml.*?[\n\n|\Z]', xml, re.DOTALL)
but I just get ['<?xml version="1.0" encoding="UTF-8"?>\n', '<?xml version="1.0" encoding="UTF-8"?>\n']
.
I've used \b
in the past to match words, but that gives no change:
>>> re.findall('<\?xml.*?[(\b\n\n\b)|\Z]', xml, re.DOTALL)
['<?xml version="1.0" encoding="UTF-8"?>\n', '<?xml version="1.0" encoding="UTF-8"?>\n']
I can't figure out how to make it work.
Upvotes: 0
Views: 366
Reputation: 140168
You're trying to match end of string OR 2 newlines in a character class []
. That doesn't work.
I'd match them in a forward lookup (doesn't consume or create groups unlike standard grouping parentheses, so findall
returns the whole string)
re.findall('<\?xml.*?(?=\n\n|\Z)', xml, re.DOTALL)
Another good workaround for this would be to match the last </block>
, starting on a new line:
re.findall('<\?xml.*?\n</block>', xml, re.DOTALL)
Upvotes: 1