Reputation: 105
I have a problem using lxml
I am using lxml to parse an xml file and again write it back to a new xml file.
Input file:
<tag1>
<tag2 attr1="a1">" example text "</tag2>
<tag3>
<tag4 attr2="a2">" example text "</tag4>
<tag5>
<tag6 attr3="a3">' example text '</tag6>
</tag5>
</tag3>
</tag1>
Script:
from lxml import etree
parser = etree.XMLParser(remove_comments=False,strip_cdata=False,resolve_entities=False)
tree = etree.parse("input.xml")
tree.write("out.xml")
Output:
<tag1>
<tag2 attr1="a1"> " example text " </tag2>
<tag3>
<tag4 attr2="a2"> " example text " </tag4>
<tag5>
<tag6 attr3="a3"> ' example text ' </tag6>
</tag5>
</tag3>
</tag1>
I want to retain "
and '
. I even tried using
f = open('output.xml', 'w')
f.write(etree.tostring(tree1.getroot(),encoding="UTF-8",xml_declaration=False))
f.close()
But none of them solved this problem.
Then I tried replacing " with "
manually.
root = tree.getroot()
tag_elements = root.iter()
for tag in tag_elements:
tag_text = tag.text
if tag_text is not None:
tag_text1 = tag_text.replace("\"",""")
tag.text = tag_text1
But this gave the below output
<tag1>
<tag2 attr1="a1"> &quot; example text &quot; </tag2>
<tag3>
<tag4 attr2="a2"> &quot; example text &quot; </tag4>
<tag5>
<tag6 attr3="a3"> ' example text ' </tag6>
</tag5>
</tag3>
</tag1>
It replaces the & with &
. I am confused here. Please help me in solving this.
Upvotes: 2
Views: 2505
Reputation: 14539
&
is the xml encoding of the character &
. "
is the xml encoding of the character "
. The characters "
and '
do not need to be encoded, so lxml does not encode them.
Have you tried decoding the document again? It should work just as you expect it to. If you need to encode the string in the document again (turn &
into &
etc.), do so with the individual strings in the lxml tree before generating the new xml document.
Upvotes: 1