Retain namespace prefix in a tag when parsing xml using lxml

Question

I have an xml as below. There are few tags which are prefixed with ce for example . When I run the code as below with xpath, in output, is replaced with </code>. I did see other links on SO like <a href="https://stackoverflow.com/questions/6597271/how-to-preserve-namespace-information-when-parsing-html-with-lxml">How to preserve namespace information when parsing HTML with lxml?</a> but not sure where and how to add namespace details. Can someone please suggest ? How can I retain <code><ce:title></code> for below xml? <pre><code>from lxml import html from lxml.etree import tostring with open('102277033304.xml', encoding='utf-8') as file_object: xml = file_object.read().strip() root = html.fromstring(xml) for element in root.xpath('//item/book/pages/*'): html = tostring(element, encoding='utf-8') print(html) </code></pre> XML: <pre><code><item> <book> <pages> <page-info> <page> <ce:title>Chapter 1</ce:title> <content>Welcome to Chapter 1</content> </page> <page> <ce:title>Chapter 2</ce:title> <content>Welcome to Chapter 2</content> </page> </page-info> <page-fulltext>Published in page 1</page-fulltext> <page-info> <page> <ce:title>Chapter 1</ce:title> <content>Welcome to Chapter 1</content> </page> <page> <ce:title>Chapter 2</ce:title> <content>Welcome to Chapter 2</content> </page> </page-info> <page-fulltext>Published in page 2</page-fulltext> <page-info> <page> <ce:title>Chapter 1</ce:title> <content>Welcome to Chapter 1</content> </page> <page> <ce:title>Chapter 2</ce:title> <content>Welcome to Chapter 2</content> </page> </page-info> <page-fulltext>Published in page 3</page-fulltext> </pages> </book> </item> </code></pre>

Jack Fleeting · Accepted Answer

That's probably caused by the fact that you are using an html parser to read xml.

Try it like this:

from lxml import etree
root = etree.XML(xml)
for element in root.xpath('//item/book/pages/*'):
        xml = etree.tostring(element, encoding='utf-8')
        print(xml)

This should give you the expected output.

Retain namespace prefix in a tag when parsing xml using lxml

Answers (1)

Related Questions