Reputation: 535
I am parsing USPTO patents from 2001 in SGML format. At top of each file, an external DTD is referenced. Unfortunately, this DTD seems to be invalid. A validity check confirms that:
Line 361
Error: A '(' character or an element type is required within declaration of element type "ADR".
<!ELEMENT ADR - - (OMC?,STR*,CITY?,CNTY?,STATE?,CTRY?,PCODE?,EAD*,TEL*,FAX* ...
However, I do not need to validate the SGML files to be processed. I just need the SGML parser to be aware of the entities. Currently, I am using Python with the LXML library. I call the XMLParser as follows:
parser = etree.XMLParser(target=SimpleXMLHandler(), resolve_entities=False, load_dtd=dtd, dtd_validation=False, recover=True)
But still, I am getting immediately the error that the external DTD is invalid in line 361. How can I avoid that issue? I am not the implementor of the DTD, so I am not willing to repair it.
Regards!
Upvotes: 3
Views: 1066