Reputation: 43
I have an xml file as follows:
<?xml version="1.0"?>
<max:SyncObject xmlns:max="http://www.ibm.com/max">
<max:ObjectSet>
<max:PARENT action="AddChange">
<max:FIELD1>string</max:FIELD1>
<max:FIELD2>string</max:FIELD2>
<max:FIELD3>string</max:FIELD3>
<max:FIELD4>string</max:FIELD4>
<max:FIELD5>string</max:FIELD5>
<max:FIELD6>string</max:FIELD6>
<max:FIELD7>string</max:FIELD7>
<max:CHILD1 action="Ignored">
<max:CH1FIELD1 action="Ignored">
<max:CH1SUB1>string</max:CH1SUB1>
<max:CH1FIELD2>string</max:CH1FIELD2>
</max:CHILD1>
<max:CHILD2 action="Ignored">
<max:CH2FIELD1>string</max:CH2FIELD1>
</max:CHILD2>
</max:PARENT>
</max:ObjectSet>
</max:SyncObject>
and my end result that I want to achieve is as follows:
{'PARENT': ['FIELD1', 'FIELD2', 'FIELD3', 'FIELD4', 'FILED5', 'FIELD6', 'FIELD7', 'CHILD1', 'CHILD2']}, {'CHILD1': ['CH1FIELD1', 'CH1FIELD2'], 'CHILD2': ['CH2FIELD1'], 'CH1FIELD1':['CH1SUB1']}
So I have tried several different methods of extracting the FIELD1
, FIELD2
... tags from the XML file while still maintaining the structure, as you can see the PARENT
dictionary is separate from the rest and contains all tags exactly one level below. This is also true for the children tags. The action attrib is not needed as this will be specified by another means within the class.
It seems that most lxml and elementtree are geared toward extracting the attributes from the XML tags and not the tags themselves.
Could anyone point me in the correct direction of extracting the tag (FIELD NAMES) without the prefix, value, or any attributes and preserve the structure?
THANKS!
Upvotes: 2
Views: 137
Reputation: 473853
First of all, your XML data is not well-formed, there is a missing closing </max:CH1FIELD1>
.
To convert it to a python data structure, use xmltodict
:
import xmltodict
data = """<?xml version="1.0"?>
<max:SyncObject xmlns:max="http://www.ibm.com/max">
<max:ObjectSet>
<max:PARENT action="AddChange">
<max:FIELD1>string</max:FIELD1>
<max:FIELD2>string</max:FIELD2>
<max:FIELD3>string</max:FIELD3>
<max:FIELD4>string</max:FIELD4>
<max:FIELD5>string</max:FIELD5>
<max:FIELD6>string</max:FIELD6>
<max:FIELD7>string</max:FIELD7>
<max:CHILD1 action="Ignored">
<max:CH1FIELD1 action="Ignored">
<max:CH1SUB1>string</max:CH1SUB1>
<max:CH1FIELD2>string</max:CH1FIELD2>
</max:CH1FIELD1>
</max:CHILD1>
<max:CHILD2 action="Ignored">
<max:CH2FIELD1>string</max:CH2FIELD1>
</max:CHILD2>
</max:PARENT>
</max:ObjectSet>
</max:SyncObject>"""
d = xmltodict.parse(data,
process_namespaces=True,
namespaces={'http://www.ibm.com/max': None})
print d
Upvotes: 1