vinay nischal
vinay nischal

Reputation: 61

How to make lxml output file with utf-8 encoding

data.xml

<?xml version="1.0" encoding="UTF-8"?>
<ArticleSet>
    <Article>            
        <LastName>Bojarski</LastName>
        <ForeName>-</ForeName>
        <Affiliation>-</Affiliation>            
    </Article>
    <Article>            
        <LastName>Genç</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

SAMPLE CODE

from lxml import etree

dom = etree.parse('data.xml')
root = dom.getroot()

for article in dom.xpath('Article[Affiliation="-"]'):
    root.remove(article)

dom.write('output.xml')

This code deletes articles whose Affiliation is equal to - i.e. whose affiliation tag looks like <Affliation>-</Affliation> when I store the remaining output into output.xml it parses the Unicode character Genç to Gen&#231; I want to store it as it is.

Code's output

<ArticleSet>
    <Article>            
        <LastName>Gen&#231;</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

Required output

<ArticleSet>
    <Article>            
        <LastName>Genç</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

Upvotes: 3

Views: 5164

Answers (1)

Sergey Belash
Sergey Belash

Reputation: 1471

There is the encoding parameter in the etree.write method. You may also use xml_declaration=True to declare encoding of the output document.

dom.write('output.xml', encoding='utf-8', xml_declaration=True)

See lxml documentation.

Upvotes: 7

Related Questions