Reputation: 61
data.xml
<?xml version="1.0" encoding="UTF-8"?>
<ArticleSet>
<Article>
<LastName>Bojarski</LastName>
<ForeName>-</ForeName>
<Affiliation>-</Affiliation>
</Article>
<Article>
<LastName>Genç</LastName>
<ForeName>Yasemin</ForeName>
<Affiliation>fgjfgnfgn</Affiliation>
</Article>
</ArticleSet>
SAMPLE CODE
from lxml import etree
dom = etree.parse('data.xml')
root = dom.getroot()
for article in dom.xpath('Article[Affiliation="-"]'):
root.remove(article)
dom.write('output.xml')
This code deletes articles whose Affiliation is equal to - i.e. whose affiliation tag looks like <Affliation>-</Affliation>
when I store the remaining output into output.xml it parses the Unicode character Genç
to Genç
I want to store it as it is.
Code's output
<ArticleSet>
<Article>
<LastName>Genç</LastName>
<ForeName>Yasemin</ForeName>
<Affiliation>fgjfgnfgn</Affiliation>
</Article>
</ArticleSet>
Required output
<ArticleSet>
<Article>
<LastName>Genç</LastName>
<ForeName>Yasemin</ForeName>
<Affiliation>fgjfgnfgn</Affiliation>
</Article>
</ArticleSet>
Upvotes: 3
Views: 5164
Reputation: 1471
There is the encoding
parameter in the etree.write
method. You may also use xml_declaration=True
to declare encoding of the output document.
dom.write('output.xml', encoding='utf-8', xml_declaration=True)
See lxml documentation.
Upvotes: 7