Raymond_90
Raymond_90

Reputation: 423

Is there a way to create XML element tree?

I am currently writing some XSD and DTD to validate a few XML files, I am writing them by hand because I've had a really bad experience with XSD generators (for example Oxygen).

However, I already have a sample XML for which I need to do this and this XML is really huge, for example, I have an element with 4312 children.

As I've had a really bad experience with XSD generators, I would like to create a kind of XML tree which would contain unique tags and attributes only, so I don't have to deal with repeating elements when looking at the XML to write a XSD.

What I mean by that is that I have for example this XML (courtesy of W3):

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food some_attribute="1.0">
    <name>Belgian Waffles</name>
    <price>$5.95</price>
    <description>
   Two of our famous Belgian Waffles with plenty of real maple syrup
   </description>
    <calories>650</calories>
</food>
<food>
    <name>Strawberry Belgian Waffles</name>
    <price>$7.95</price>
    <description>
    Light Belgian waffles covered with strawberries and whipped cream
    </description>
    <calories>900</calories>
</food>
<food>
    <name>Berry-Berry Belgian Waffles</name>
    <price>$8.95</price>
    <description>
    Belgian waffles covered with assorted fresh berries and whipped cream
    </description>
    <calories>900</calories>
</food>
<food>
    <name>French Toast</name>
    <price>$4.50</price>
    <description>
    Thick slices made from our homemade sourdough bread
    </description>
    <calories>600</calories>
    <some_complex_type_element_1>
      <some_simple_type_element_1>Text.</some_simple_type_element_1>
    </some_complex_type_element_1>
</food>
<food>
    <name>Homestyle Breakfast</name>
    <price>$6.95</price>
    <description>
    Two eggs, bacon or sausage, toast, and our ever-popular hash browns
    </description>
    <calories>950</calories>
    <some_simple_type_element_2>Text.</some_simple_type_element_2>
</food>
</breakfast_menu>

As you can see there are 4 types of unique elements under the root element.

These are:

What I would like to achieve is some tree representation of this XML but containing only unique elements and without text.

So from my example (I do not care about the information inside of tags) there are 4 different unique elements in root, so I would like to get either another XML representation, or even some ASCII representation of the structure of the document, so for example something like:

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food some_attribute="">
    <name></name>
    <price></price>
    <description></description>
    <calories></calories>
</food>
<food>
    <name></name>
    <price></price>
    <description></description>
    <calories></calories>
</food>
<food>
    <name></name>
    <price></price>
    <description></description>
    <calories></calories>
    <some_complex_type_element_1>
      <some_simple_type_element_1></some_simple_type_element_1>
    </some_complex_type_element_1>
</food>
<food>
    <name></name>
    <price></price>
    <description></description>
    <calories></calories>
    <some_simple_type_element_2></some_simple_type_element_2>
</food>
</breakfast_menu>

Notice there are only tags, no actual values, and only unique tags, I would also like to keep attributes, but I don't care about its value, only that it exists as of now.

The second option would be some ASCII, so for example something like:

breakfast_menu
├── food some_attribute
│   ├── name
│   ├── price
│   ├── description
│   └── calories
├── food
│   ├── name
│   ├── price
│   ├── description
│   └── calories
├── food
│   ├── name
│   ├── price
│   ├── description
│   ├── calories
│   └── some_complex_type_element_1
│       └── some_simple_type_element_1
└─ food
    ├── name
    ├── price
    ├── description
    ├── calories
    └── some_simple_type_element_2

Do you know of any software, whether its online or desktop, that can generate something like this (ideally on mac)?

Or is this possible with python and elementtree?

I just need to generate something like this and I am looking for the simplest solution, also if you have a better idea(maybe there is a better approach to this), I am open to every and any suggestion, so please let me know.

Thank you

Edit

Using Power Query you can generate an "okay" representation of your XML, from my testing it kind of work.

You can generate an XML structure like the one below, however, it is not the greatest solution and it is also not ideal for attributes.

enter image description here

You can reproduce this result by usigin similar steps:

enter image description here

It is however not the cleanest solution, I am still looking for ideas, thanks!

Upvotes: 1

Views: 109

Answers (1)

dabingsou
dabingsou

Reputation: 2469

See if this meets your needs.

from simplified_scrapy import SimplifiedDoc, utils

xml = '''
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
    <food some_attribute="1.0">
        <name>Belgian Waffles</name>
        <price>$5.95</price>
        <description>
    Two of our famous Belgian Waffles with plenty of real maple syrup
    </description>
        <calories>650</calories>
    </food>
    <food>
        <name>Strawberry Belgian Waffles</name>
        <price>$7.95</price>
        <description>
        Light Belgian waffles covered with strawberries and whipped cream
        </description>
        <calories>900</calories>
    </food>
    <food>
        <name>Berry-Berry Belgian Waffles</name>
        <price>$8.95</price>
        <description>
        Belgian waffles covered with assorted fresh berries and whipped cream
        </description>
        <calories>900</calories>
    </food>
    <food>
        <name>French Toast</name>
        <price>$4.50</price>
        <description>
        Thick slices made from our homemade sourdough bread
        </description>
        <calories>600</calories>
        <some_complex_type_element_1>
        <some_simple_type_element_1>Text.</some_simple_type_element_1>
        </some_complex_type_element_1>
    </food>
    <food>
        <name>Homestyle Breakfast</name>
        <price>$6.95</price>
        <description>
        Two eggs, bacon or sausage, toast, and our ever-popular hash browns
        </description>
        <calories>950</calories>
        <some_simple_type_element_2>Text.</some_simple_type_element_2>
    </food>
</breakfast_menu>
'''

def loop(node):
    para = {}
    for k in node:
        if k=='tag' or k=='html': continue
        para[k] = ''
    if para: node.setAttrs(para) # Remove attributes
    children = node.children
    if children:
        for c in children:
            loop(c)
    else:
        if node.text:
            node.setContent('') # Remove value

doc = SimplifiedDoc(xml)
# Remove values and attributes
loop(doc.breakfast_menu)

dicNode = {}
for node in doc.breakfast_menu.children:
    key = node.outerHtml
    if dicNode.get(key):
        node.remove() # Delete duplicate
    else:
        dicNode[key] = True

print(doc.html)

Result:

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
    <food some_attribute="">
        <name></name>
        <price></price>
        <description></description>
        <calories></calories>
    </food>
    <food>
        <name></name>
        <price></price>
        <description></description>
        <calories></calories>
    </food>
    <food>
        <name></name>
        <price></price>
        <description></description>
        <calories></calories>
        <some_complex_type_element_1>
        <some_simple_type_element_1></some_simple_type_element_1>
        </some_complex_type_element_1>
    </food>
    <food>
        <name></name>
        <price></price>
        <description></description>
        <calories></calories>
        <some_simple_type_element_2></some_simple_type_element_2>
    </food>
</breakfast_menu>

For large files, try the following method.

from simplified_scrapy import SimplifiedDoc, utils
from simplified_scrapy.core.regex_helper import replaceReg

filePath = 'test.xml'
doc = SimplifiedDoc()
doc.loadFile(filePath, lineByline=True)

utils.appendFile('dest.xml','<?xml version="1.0" encoding="UTF-8"?><breakfast_menu>')
dicNode = {}
for node in doc.getIterable('food'):
    key = node.outerHtml
    key = replaceReg(key, '>[^>]*?<', '><')
    key = replaceReg(key, '"[^"]*?"', '""')

    if not dicNode.get(key):
        dicNode[key] = True
        utils.appendFile('dest.xml', key)


utils.appendFile('dest.xml', '</breakfast_menu>')

Upvotes: 2

Related Questions