XAnguera
XAnguera

Reputation: 1267

how to keep & when parsing an xml file using lxml and xpath

I am trying to extract some information from an input xml file and print it into an output file by using lxml and xpath instructions. I am getting a problem when reading an xml tag like the following

...
<editor> Barnes &amp; Nobel </editor>
...

In order to parse the xml file and print the editor content I use (there is always only one editor in the xml):

parser = etree.XMLParser(encoding='utf-8')
docTree = etree.parse( io.BytesIO(open(inputXML, "r").read()), parser )
print docTree.xpath('//editor')[0].text

My problem is that the &amp; gets converted at some point into '&', which messes up my further processing.

How can I ensure that the &amp; symbol will not be "decoded"?

Upvotes: 3

Views: 3336

Answers (2)

XAnguera
XAnguera

Reputation: 1267

I finally found the answer to my own question in the answer of How do I escape ampersands in XML so they are rendered as entities in HTML? In my code I have added an intermediate step to ensure that all & characters will remain the same at the output. This is

parser = etree.XMLParser(encoding='utf-8')
xmlText = open(inputXML, "r").read().replace("&amp;", "&amp;amp;")
docTree = etree.parse( io.BytesIO(xmlText), parser )
print docTree.xpath('//editor')[0].text

In fact, just in case, I have applied the same recipe to other possible entities in XML as defined in http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined%5Fentities%5Fin%5FXML

Upvotes: 0

Ned Batchelder
Ned Batchelder

Reputation: 375734

I know this will sound presumptuous, but you want the data to be "&". That is the text content of the XML element. If you have later processing that needs it as "&amp;", then you need a step that will XML- (or HTML-) encode it back to "&amp;",

You cannot ask an XML parser to parse your document and not turn "&amp;" into "&". It won't do it.

Upvotes: 1

Related Questions