Reputation: 423
I am currently writing some XSD and DTD to validate a few XML files, I am writing them by hand because I've had a really bad experience with XSD generators (for example Oxygen).
However, I already have a sample XML for which I need to do this and this XML is really huge, for example, I have an element with 4312 children.
As I've had a really bad experience with XSD generators, I would like to create a kind of XML tree which would contain unique tags and attributes only, so I don't have to deal with repeating elements when looking at the XML to write a XSD.
What I mean by that is that I have for example this XML (courtesy of W3):
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food some_attribute="1.0">
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>
Two of our famous Belgian Waffles with plenty of real maple syrup
</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>
Light Belgian waffles covered with strawberries and whipped cream
</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>
Belgian waffles covered with assorted fresh berries and whipped cream
</description>
<calories>900</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>
Thick slices made from our homemade sourdough bread
</description>
<calories>600</calories>
<some_complex_type_element_1>
<some_simple_type_element_1>Text.</some_simple_type_element_1>
</some_complex_type_element_1>
</food>
<food>
<name>Homestyle Breakfast</name>
<price>$6.95</price>
<description>
Two eggs, bacon or sausage, toast, and our ever-popular hash browns
</description>
<calories>950</calories>
<some_simple_type_element_2>Text.</some_simple_type_element_2>
</food>
</breakfast_menu>
As you can see there are 4 types of unique elements under the root element.
These are:
What I would like to achieve is some tree representation of this XML but containing only unique elements and without text.
So from my example (I do not care about the information inside of tags) there are 4 different unique elements in root, so I would like to get either another XML representation, or even some ASCII representation of the structure of the document, so for example something like:
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food some_attribute="">
<name></name>
<price></price>
<description></description>
<calories></calories>
</food>
<food>
<name></name>
<price></price>
<description></description>
<calories></calories>
</food>
<food>
<name></name>
<price></price>
<description></description>
<calories></calories>
<some_complex_type_element_1>
<some_simple_type_element_1></some_simple_type_element_1>
</some_complex_type_element_1>
</food>
<food>
<name></name>
<price></price>
<description></description>
<calories></calories>
<some_simple_type_element_2></some_simple_type_element_2>
</food>
</breakfast_menu>
Notice there are only tags, no actual values, and only unique tags, I would also like to keep attributes, but I don't care about its value, only that it exists as of now.
The second option would be some ASCII, so for example something like:
breakfast_menu
├── food some_attribute
│ ├── name
│ ├── price
│ ├── description
│ └── calories
├── food
│ ├── name
│ ├── price
│ ├── description
│ └── calories
├── food
│ ├── name
│ ├── price
│ ├── description
│ ├── calories
│ └── some_complex_type_element_1
│ └── some_simple_type_element_1
└─ food
├── name
├── price
├── description
├── calories
└── some_simple_type_element_2
Do you know of any software, whether its online or desktop, that can generate something like this (ideally on mac)?
Or is this possible with python and elementtree?
I just need to generate something like this and I am looking for the simplest solution, also if you have a better idea(maybe there is a better approach to this), I am open to every and any suggestion, so please let me know.
Thank you
Edit
Using Power Query you can generate an "okay" representation of your XML, from my testing it kind of work.
You can generate an XML structure like the one below, however, it is not the greatest solution and it is also not ideal for attributes.
You can reproduce this result by usigin similar steps:
It is however not the cleanest solution, I am still looking for ideas, thanks!
Upvotes: 1
Views: 109
Reputation: 2469
See if this meets your needs.
from simplified_scrapy import SimplifiedDoc, utils
xml = '''
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food some_attribute="1.0">
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>
Two of our famous Belgian Waffles with plenty of real maple syrup
</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>
Light Belgian waffles covered with strawberries and whipped cream
</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>
Belgian waffles covered with assorted fresh berries and whipped cream
</description>
<calories>900</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>
Thick slices made from our homemade sourdough bread
</description>
<calories>600</calories>
<some_complex_type_element_1>
<some_simple_type_element_1>Text.</some_simple_type_element_1>
</some_complex_type_element_1>
</food>
<food>
<name>Homestyle Breakfast</name>
<price>$6.95</price>
<description>
Two eggs, bacon or sausage, toast, and our ever-popular hash browns
</description>
<calories>950</calories>
<some_simple_type_element_2>Text.</some_simple_type_element_2>
</food>
</breakfast_menu>
'''
def loop(node):
para = {}
for k in node:
if k=='tag' or k=='html': continue
para[k] = ''
if para: node.setAttrs(para) # Remove attributes
children = node.children
if children:
for c in children:
loop(c)
else:
if node.text:
node.setContent('') # Remove value
doc = SimplifiedDoc(xml)
# Remove values and attributes
loop(doc.breakfast_menu)
dicNode = {}
for node in doc.breakfast_menu.children:
key = node.outerHtml
if dicNode.get(key):
node.remove() # Delete duplicate
else:
dicNode[key] = True
print(doc.html)
Result:
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food some_attribute="">
<name></name>
<price></price>
<description></description>
<calories></calories>
</food>
<food>
<name></name>
<price></price>
<description></description>
<calories></calories>
</food>
<food>
<name></name>
<price></price>
<description></description>
<calories></calories>
<some_complex_type_element_1>
<some_simple_type_element_1></some_simple_type_element_1>
</some_complex_type_element_1>
</food>
<food>
<name></name>
<price></price>
<description></description>
<calories></calories>
<some_simple_type_element_2></some_simple_type_element_2>
</food>
</breakfast_menu>
For large files, try the following method.
from simplified_scrapy import SimplifiedDoc, utils
from simplified_scrapy.core.regex_helper import replaceReg
filePath = 'test.xml'
doc = SimplifiedDoc()
doc.loadFile(filePath, lineByline=True)
utils.appendFile('dest.xml','<?xml version="1.0" encoding="UTF-8"?><breakfast_menu>')
dicNode = {}
for node in doc.getIterable('food'):
key = node.outerHtml
key = replaceReg(key, '>[^>]*?<', '><')
key = replaceReg(key, '"[^"]*?"', '""')
if not dicNode.get(key):
dicNode[key] = True
utils.appendFile('dest.xml', key)
utils.appendFile('dest.xml', '</breakfast_menu>')
Upvotes: 2