Reputation: 10647
Say I have the following XML:
<foo>
<spam taste="great"> stuff</spam> <spam taste="moldy"> stuff</spam>
<bar taste="eww"> stuff </bar> <bar> stuff </bar>
<bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon>
</foo>
With spam, bar, and bacon being data tags with more tags inside, I want to split the XML into this
<spam taste="great"> stuff</spam> <spam taste="moldy"> stuff</spam>
,<bar taste="eww"> stuff </bar> <bar> stuff </bar>
,<bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon>
,in order to reorder it for parsing.
The basic structure like this, with the blocks being in any order.
<foo>
block of bar tags
block of spam tags
block of bacon tags
</foo>
Upvotes: 2
Views: 773
Reputation: 42748
Have you looked at the ElementTree methods?
import xml.etree.ElementTree as ET
document = ET.parse("file.xml")
spams = document.findall("spam")
bars = document.findall("bar")
bacon = 'document.findall("bacon")
Upvotes: 0
Reputation: 64038
If you don't know what the names of the tags are at runtime + just want to break up the elements by group, you can perhaps try using itertools.groupby in combination with whatever xml parsing library you want:
import xml.etree.ElementTree as et
import itertools
raw_xml = '''<foo>
<spam taste="great"> stuff</spam> <spam taste="moldy"> stuff</spam>
<bar taste="eww"> stuff </bar> <bar> stuff </bar>
<bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon>
<spam taste="Great">stuff2</spam>
</foo>'''
groups = itertools.groupby(et.fromstring(raw_xml), lambda element: element.tag)
groups = [list(group[1]) for group in groups]
print groups
The output would then be:
[[<Element 'spam' at 0x218ecb0>, <Element 'spam' at 0x218ee10>],
[<Element 'bar' at 0x218ee90>, <Element 'bar' at 0x218eeb0>],
[<Element 'bacon' at 0x218ef30>, <Element 'bacon' at 0x218ef50>, <Element 'bacon' at 0x218ef90>],
[<Element 'spam' at 0x218efd0>]]
If you need the actual string value, you can do:
print [[et.tostring(element) for element in group] for group in groups]
...which will get you:
[['<spam taste="great"> stuff</spam> ', '<spam taste="moldy">stuff</spam>\n'],
['<bar taste="eww"> stuff </bar> ', '<bar> stuff </bar> \n'],
['<bacon taste="yum"> stuff </bacon>', '<bacon taste="yum"> stuff </bacon>', '<bacon taste="yum">stuff </bacon>\n'],
['<spam taste="Great">stuff2</spam>\n']]
Upvotes: 1