Reputation: 1267
I am trying to extract some information from an input xml file and print it into an output file by using lxml and xpath instructions. I am getting a problem when reading an xml tag like the following
...
<editor> Barnes & Nobel </editor>
...
In order to parse the xml file and print the editor content I use (there is always only one editor in the xml):
parser = etree.XMLParser(encoding='utf-8')
docTree = etree.parse( io.BytesIO(open(inputXML, "r").read()), parser )
print docTree.xpath('//editor')[0].text
My problem is that the &
gets converted at some point into '&'
, which messes up my further processing.
How can I ensure that the &
symbol will not be "decoded"?
Upvotes: 3
Views: 3336
Reputation: 1267
I finally found the answer to my own question in the answer of How do I escape ampersands in XML so they are rendered as entities in HTML? In my code I have added an intermediate step to ensure that all & characters will remain the same at the output. This is
parser = etree.XMLParser(encoding='utf-8')
xmlText = open(inputXML, "r").read().replace("&", "&amp;")
docTree = etree.parse( io.BytesIO(xmlText), parser )
print docTree.xpath('//editor')[0].text
In fact, just in case, I have applied the same recipe to other possible entities in XML as defined in http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined%5Fentities%5Fin%5FXML
Upvotes: 0
Reputation: 375734
I know this will sound presumptuous, but you want the data to be "&"
. That is the text content of the XML element. If you have later processing that needs it as "&"
, then you need a step that will XML- (or HTML-) encode it back to "&"
,
You cannot ask an XML parser to parse your document and not turn "&"
into "&"
. It won't do it.
Upvotes: 1