dpnishant
dpnishant

Reputation: 125

Python XML Processing with String Operations

My requirement is quite straightforward. I want to use a XML processing library in Python that allows me to query the document to find specific nodes based upon some logic (say attribute value check etc.) and return the exact string of the node as present input XML file without any beautification, pretty printing, implicit/explicit tag closing, attribute ordering etc. The reason for this is once I get the exact string of the node, I can then "grep" the input XML file to locate the line number.

I tried to play with xml.dom.minidom and xml.etree.ElementTree but none of them fit my requirements. To clarify my requirement let me take an example:

Input File

<?xml version="1.0" encoding="utf-8"?>
...snip...
<sometag attr2='1' attr1="3" />
...snip...

Python Code

Using minidom

from xml.dom.minidom import parse
dom = parse(path)
sometags = dom.getElementsByTagName('sometag')
for node in sometags:
    if int(node.getAttribute('attr1')) > 0:
        print node.toxml()

Using ElementTree

import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
elem = tree.findall('.//sometag')[0]
str_elem = ET.tostring(elem, method='xml') #tried changing method to text, html; no luck

Both the snippets 'generate' the XML string instead of extracting it from the XML input file itself which is why it is not exactly the same string as its in the file which is why if I do a string search in the file, it fails. By exact string I mean, one of the library does implicit tag closing while other does explicit e.g. the resulting output string would be even if the in the XML file it is and one orders the attribute values are alphabetically ordered e.g. and vice versa in case of the other library. Is there a way to get this thing done? Please help.

Upvotes: 0

Views: 241

Answers (1)

mhawke
mhawke

Reputation: 87124

Using lxml (see also) you can get the line number using the Element.sourceline attribute. This should please because now you do not need to grep for it in the source XML file.

from lxml import etree
root = etree.parse('file.xml').getroot()
elem = root.findall('.//sometag')[0]

>>> print root.sourceline
2
>>> print elem.sourceline
3

If, for some reason, you can't use lxml, take a look at http://davejingtian.org/2014/07/04/python-hacking-make-elementtree-support-line-number/

Upvotes: 1

Related Questions