Is there a way to create XML element tree?

Question

I am currently writing some XSD and DTD to validate a few XML files, I am writing them by hand because I've had a really bad experience with XSD generators (for example Oxygen).

However, I already have a sample XML for which I need to do this and this XML is really huge, for example, I have an element with 4312 children.

As I've had a really bad experience with XSD generators, I would like to create a kind of XML tree which would contain unique tags and attributes only, so I don't have to deal with repeating elements when looking at the XML to write a XSD.

What I mean by that is that I have for example this XML (courtesy of W3):




    Belgian Waffles
    $5.95
    
   Two of our famous Belgian Waffles with plenty of real maple syrup
   
    650


    Strawberry Belgian Waffles
    $7.95
    
    Light Belgian waffles covered with strawberries and whipped cream
    
    900


    Berry-Berry Belgian Waffles
    $8.95
    
    Belgian waffles covered with assorted fresh berries and whipped cream
    
    900


    French Toast
    $4.50
    
    Thick slices made from our homemade sourdough bread
    
    600
    
      Text.
    


    Homestyle Breakfast
    $6.95
    
    Two eggs, bacon or sausage, toast, and our ever-popular hash browns
    
    950
    Text.

As you can see there are 4 types of unique elements under the root element.

These are:

Element 1 (Has attribute),
Element 2 and 3,
Element 4 (Has another complexType element),
Element 5 (Has another simpleType element).

What I would like to achieve is some tree representation of this XML but containing only unique elements and without text.

So from my example (I do not care about the information inside of tags) there are 4 different unique elements in root, so I would like to get either another XML representation, or even some ASCII representation of the structure of the document, so for example something like:

Notice there are only tags, no actual values, and only unique tags, I would also like to keep attributes, but I don't care about its value, only that it exists as of now.

The second option would be some ASCII, so for example something like:

breakfast_menu
├── food some_attribute
│   ├── name
│   ├── price
│   ├── description
│   └── calories
├── food
│   ├── name
│   ├── price
│   ├── description
│   └── calories
├── food
│   ├── name
│   ├── price
│   ├── description
│   ├── calories
│   └── some_complex_type_element_1
│       └── some_simple_type_element_1
└─ food
    ├── name
    ├── price
    ├── description
    ├── calories
    └── some_simple_type_element_2

Do you know of any software, whether its online or desktop, that can generate something like this (ideally on mac)?

Or is this possible with python and elementtree?

I just need to generate something like this and I am looking for the simplest solution, also if you have a better idea(maybe there is a better approach to this), I am open to every and any suggestion, so please let me know.

Thank you

Edit

Using Power Query you can generate an "okay" representation of your XML, from my testing it kind of work.

You can generate an XML structure like the one below, however, it is not the greatest solution and it is also not ideal for attributes.

You can reproduce this result by usigin similar steps:

It is however not the cleanest solution, I am still looking for ideas, thanks!

dabingsou · Accepted Answer

See if this meets your needs.

from simplified_scrapy import SimplifiedDoc, utils

xml = '''


    
        Belgian Waffles
        $5.95
        
    Two of our famous Belgian Waffles with plenty of real maple syrup
    
        650
    
    
        Strawberry Belgian Waffles
        $7.95
        
        Light Belgian waffles covered with strawberries and whipped cream
        
        900
    
    
        Berry-Berry Belgian Waffles
        $8.95
        
        Belgian waffles covered with assorted fresh berries and whipped cream
        
        900
    
    
        French Toast
        $4.50
        
        Thick slices made from our homemade sourdough bread
        
        600
        
        Text.
        
    
    
        Homestyle Breakfast
        $6.95
        
        Two eggs, bacon or sausage, toast, and our ever-popular hash browns
        
        950
        Text.
    

'''

def loop(node):
    para = {}
    for k in node:
        if k=='tag' or k=='html': continue
        para[k] = ''
    if para: node.setAttrs(para) # Remove attributes
    children = node.children
    if children:
        for c in children:
            loop(c)
    else:
        if node.text:
            node.setContent('') # Remove value

doc = SimplifiedDoc(xml)
# Remove values and attributes
loop(doc.breakfast_menu)

dicNode = {}
for node in doc.breakfast_menu.children:
    key = node.outerHtml
    if dicNode.get(key):
        node.remove() # Delete duplicate
    else:
        dicNode[key] = True

print(doc.html)

Result:

For large files, try the following method.

from simplified_scrapy import SimplifiedDoc, utils
from simplified_scrapy.core.regex_helper import replaceReg

filePath = 'test.xml'
doc = SimplifiedDoc()
doc.loadFile(filePath, lineByline=True)

utils.appendFile('dest.xml','')
dicNode = {}
for node in doc.getIterable('food'):
    key = node.outerHtml
    key = replaceReg(key, '>[^>]*?<', '><')
    key = replaceReg(key, '"[^"]*?"', '""')

    if not dicNode.get(key):
        dicNode[key] = True
        utils.appendFile('dest.xml', key)


utils.appendFile('dest.xml', '')

Is there a way to create XML element tree?

Answers (1)

Related Questions