radha shankar
radha shankar

Reputation: 105

Lxml : Ampersand in text

I have a problem using lxml

I am using lxml to parse an xml file and again write it back to a new xml file.

Input file:

<tag1>
  <tag2 attr1="a1">&quot; example text &quot;</tag2>
  <tag3>
    <tag4 attr2="a2">&quot; example text &quot;</tag4>
    <tag5>
      <tag6 attr3="a3">&apos; example text &apos;</tag6>
    </tag5>
  </tag3>
</tag1>

Script:

    from lxml import etree
    parser = etree.XMLParser(remove_comments=False,strip_cdata=False,resolve_entities=False)
    tree = etree.parse("input.xml")
    tree.write("out.xml")

Output:

<tag1>
  <tag2 attr1="a1"> " example text "  </tag2>
  <tag3>
    <tag4 attr2="a2"> " example text " </tag4>
    <tag5>
      <tag6 attr3="a3"> ' example text ' </tag6>
    </tag5>
  </tag3>
</tag1>

I want to retain &quot; and &apos; . I even tried using

f = open('output.xml', 'w')
f.write(etree.tostring(tree1.getroot(),encoding="UTF-8",xml_declaration=False))
f.close()

But none of them solved this problem.

Then I tried replacing " with &quot; manually.

root = tree.getroot()
tag_elements = root.iter()
for tag in tag_elements:
        tag_text = tag.text
        if tag_text is not None:
               tag_text1 = tag_text.replace("\"","&quot;")
               tag.text = tag_text1

But this gave the below output

<tag1>
  <tag2 attr1="a1"> &amp;quot; example text &amp;quot;  </tag2>
  <tag3>
    <tag4 attr2="a2"> &amp;quot; example text &amp;quot; </tag4>
    <tag5>
      <tag6 attr3="a3"> &apos; example text &apos; </tag6>
    </tag5>
  </tag3>
</tag1>

It replaces the & with &amp; . I am confused here. Please help me in solving this.

Upvotes: 2

Views: 2505

Answers (1)

Drathier
Drathier

Reputation: 14539

&amp; is the xml encoding of the character &. &quot; is the xml encoding of the character ". The characters " and ' do not need to be encoded, so lxml does not encode them.

Have you tried decoding the document again? It should work just as you expect it to. If you need to encode the string in the document again (turn & into &amp; etc.), do so with the individual strings in the lxml tree before generating the new xml document.

Upvotes: 1

Related Questions