how to check lxml element tree strings?

Question

I have a list of lxml element trees. I would like to store in a dictionary the number of times a sub-tree appears in any of subtress of the list of trees. For example

tree1=''''''
tree2=''''''
tree3=''''''
list_trees=[tree1,tree2,tree3]
print list_trees
from collections import defaultdict
from lxml import etree as ET
mydict=defaultdict(int) 
for tree in list_trees:
    root=ET.fromstring(tree)
    for sub_root in root.iter():
        print ET.tostring(sub_root)
        mydict[ET.tostring(sub_root)]+=1
print mydict

I get the following correct result:

defaultdict(, {'': 1, '': 2, '': 1, '': 2, '': 1, '': 1, '': 1})

This only works in this particular example. However, In the general case, xmls can be identical but have different ordering of attributes, or extra white spaces or new lines that don't matter. However, this general case will break my system. I know that there have been posts about how to check 2 identical xml trees, however, i would like to convert the xmls into strings in order to do this particular application described above (easily keeping unique trees as string allows for easy comparisons and more flexibility in the future) and also be able to store it in sql nicely. How can an xml be made into a string in a consistent matter, regardless of orderings, or extra spaces, extra lines?

editing for giving the case that does not work: These 3 xml trees are identical, they just have different ordering of attributes or extra spaces or new lines.

tree4=''''''
tree5='''
'''
tree6=''''''

My output gives the following:

defaultdict(, {'': 3, '': 1, '
': 1, '': 3, '': 1})

However, the output should be:

defaultdict(, {'': 3, '': 3, '': 3})

supersam654 · Accepted Answer

If you insist on comparing the string representation of XML trees, I recommend using BeautifulSoup on top of lxml. In particular, calling prettify() on any part of the tree creates a distinct representation that ignores whitespace and strange formatting from the input. The output strings are bit more verbose but they work. I went ahead and replaced newlines with "fake newlines" (' ' -> '\n') so the output is more compact.

from collections import defaultdict
from bs4 import BeautifulSoup as Soup

tree4=''''''
tree5='''
'''
tree6=''''''
list_trees = [tree4, tree5, tree6]

mydict = defaultdict(int)
for tree in list_trees:
    root = Soup(tree, 'lxml-xml') # Use the LXML XML parser.
    for sub_root in root.find_all():
        print(sub_root)
        mydict[sub_root.prettify().replace('
', '\n')] += 1

print('Results')
for key, value in mydict.items():
    print(u'%s: %s' % (key, value))

Which prints out the desired results (with a few extra newlines and spaces):

$ python counter.py

how to check lxml element tree strings?

Answers (1)

Related Questions