Shankar Guru
Shankar Guru

Reputation: 1161

Retain namespace prefix in a tag when parsing xml using lxml

I have an xml as below. There are few tags which are prefixed with ce for example <ce:title>. When I run the code as below with xpath, in output, <ce:title> is replaced with <title>. I did see other links on SO like How to preserve namespace information when parsing HTML with lxml? but not sure where and how to add namespace details.

Can someone please suggest ? How can I retain <ce:title> for below xml?

from lxml import html
from lxml.etree import tostring
with open('102277033304.xml', encoding='utf-8') as file_object:
    xml = file_object.read().strip()
    root = html.fromstring(xml)
    for element in root.xpath('//item/book/pages/*'):
        html = tostring(element, encoding='utf-8')
        print(html)

XML:

<item>
    <book>
        <pages>
            <page-info>
                <page>
                  <ce:title>Chapter 1</ce:title>
                  <content>Welcome to Chapter 1</content>
                </page>
                <page>
                 <ce:title>Chapter 2</ce:title>
                 <content>Welcome to Chapter 2</content>
                </page>
            </page-info>
            <page-fulltext>Published in page 1</page-fulltext>
            <page-info>
                <page>
                  <ce:title>Chapter 1</ce:title>
                  <content>Welcome to Chapter 1</content>
                </page>
                <page>
                 <ce:title>Chapter 2</ce:title>
                 <content>Welcome to Chapter 2</content>
                </page>
            </page-info>
            <page-fulltext>Published in page 2</page-fulltext>
            <page-info>
                <page>
                  <ce:title>Chapter 1</ce:title>
                  <content>Welcome to Chapter 1</content>
                </page>
                <page>
                 <ce:title>Chapter 2</ce:title>
                 <content>Welcome to Chapter 2</content>
                </page>
            </page-info>
            <page-fulltext>Published in page 3</page-fulltext>
        </pages>
    </book>
</item>

Upvotes: 1

Views: 474

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24930

That's probably caused by the fact that you are using an html parser to read xml.

Try it like this:

from lxml import etree
root = etree.XML(xml)
for element in root.xpath('//item/book/pages/*'):
        xml = etree.tostring(element, encoding='utf-8')
        print(xml)

This should give you the expected output.

Upvotes: 1

Related Questions