noɥʇʎԀʎzɐɹƆ
noɥʇʎԀʎzɐɹƆ

Reputation: 10647

How can I split a xml document into strings between a certain tag?

Say I have the following XML:

<foo>
<spam taste="great"> stuff</spam> <spam taste="moldy"> stuff</spam>
<bar taste="eww"> stuff </bar> <bar> stuff </bar> 
<bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon>
</foo>

With spam, bar, and bacon being data tags with more tags inside, I want to split the XML into this

in order to reorder it for parsing.

The basic structure like this, with the blocks being in any order.

<foo>
block of bar tags
block of spam tags
block of bacon tags
</foo>

Upvotes: 2

Views: 773

Answers (2)

Daniel
Daniel

Reputation: 42748

Have you looked at the ElementTree methods?

import xml.etree.ElementTree as ET

document = ET.parse("file.xml")
spams = document.findall("spam")
bars = document.findall("bar")
bacon = 'document.findall("bacon")

Upvotes: 0

Michael0x2a
Michael0x2a

Reputation: 64038

If you don't know what the names of the tags are at runtime + just want to break up the elements by group, you can perhaps try using itertools.groupby in combination with whatever xml parsing library you want:

import xml.etree.ElementTree as et
import itertools

raw_xml = '''<foo>
<spam taste="great"> stuff</spam> <spam taste="moldy"> stuff</spam>
<bar taste="eww"> stuff </bar> <bar> stuff </bar> 
<bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon>
<spam taste="Great">stuff2</spam>
</foo>'''

groups = itertools.groupby(et.fromstring(raw_xml), lambda element: element.tag)
groups = [list(group[1]) for group in groups]

print groups

The output would then be:

[[<Element 'spam' at 0x218ecb0>, <Element 'spam' at 0x218ee10>], 
 [<Element 'bar' at 0x218ee90>, <Element 'bar' at 0x218eeb0>], 
 [<Element 'bacon' at 0x218ef30>, <Element 'bacon' at 0x218ef50>, <Element 'bacon' at 0x218ef90>], 
 [<Element 'spam' at 0x218efd0>]]

If you need the actual string value, you can do:

print [[et.tostring(element) for element in group] for group in groups]

...which will get you:

[['<spam taste="great"> stuff</spam> ', '<spam taste="moldy">stuff</spam>\n'],
 ['<bar taste="eww"> stuff </bar> ', '<bar> stuff </bar> \n'], 
 ['<bacon taste="yum"> stuff </bacon>', '<bacon taste="yum"> stuff </bacon>', '<bacon taste="yum">stuff </bacon>\n'], 
 ['<spam taste="Great">stuff2</spam>\n']]

Upvotes: 1

Related Questions