labrassbandito
labrassbandito

Reputation: 535

Python: Avoid DTD validation with LXML

I am parsing USPTO patents from 2001 in SGML format. At top of each file, an external DTD is referenced. Unfortunately, this DTD seems to be invalid. A validity check confirms that:

Line 361
Error: A '(' character or an element type is required within declaration of element type "ADR".
<!ELEMENT ADR  - - (OMC?,STR*,CITY?,CNTY?,STATE?,CTRY?,PCODE?,EAD*,TEL*,FAX* ...

However, I do not need to validate the SGML files to be processed. I just need the SGML parser to be aware of the entities. Currently, I am using Python with the LXML library. I call the XMLParser as follows:

parser = etree.XMLParser(target=SimpleXMLHandler(), resolve_entities=False, load_dtd=dtd, dtd_validation=False, recover=True)  

But still, I am getting immediately the error that the external DTD is invalid in line 361. How can I avoid that issue? I am not the implementor of the DTD, so I am not willing to repair it.

Regards!

Upvotes: 3

Views: 1066

Answers (1)

Steven
Steven

Reputation: 28666

As Chrono Kitsune already noted: the problem lies with xml versus sgml: the DTD is not a correct xml dtd, because it is an sgml dtd.

I'd suggest converting the sgml documents to xml first, for example using sx.

Upvotes: 5

Related Questions