wolfgang
wolfgang

Reputation: 7789

how to parse xml without dtd validation and using lxml?

I've tried using the following code which has invalid dtd/xml

<city>
<address>
      <zipcode>4455</zipcode>
</address>

I'm trying to parse with with lxml

like this,

from lxml import etree as ET

parser = ET.XMLParser(dtd_validation=False)
tree = ET.fromstring(xml_data,parser)
print(tree.xpath('//zipcode'))

Unfortunately, This code still gives xml errors,

Any idea how i can get a non-validating parse of the above xml?

Upvotes: 0

Views: 1355

Answers (1)

har07
har07

Reputation: 89295

Assuming that by 'invalid dtd' you meant that the <city> tag is not closed in above XML sample, then your document is actually invalid XML or frankly it isn't XML at all because it doesn't follow XML rules.

You need to fix the document somehow to be able to treat it as an XML document. For this simple unclosed tag case, setting recover=True will do the job :

from lxml import etree as ET

parser = ET.XMLParser(recover=True)
tree = ET.fromstring(xml_data,parser)
print(tree.xpath('//zipcode'))

Upvotes: 2

Related Questions