Reputation: 809
I am trying to parse an XML using SAX Parser but keep getting XML document structures must start and end within the same entity.
which is expected as the XML doc I get from other source won't be a proper one. But I don't want this exception to be raised as I would like to parse an XML document till I find the <myTag>
in that document and I don't care whether that doc got proper starting and closing entities.
Example:
<employeeDetails>
<firstName>xyz</firsName>
<lastName>orp</lastName>
<departmentDetails>
<departName>SALES</departName>
<departCode>982</departCode>...
Here I don't want to care whether the document is valid one or not as this part is not in my hand. So I would like to parse this document till I see <departName>
after that I don't want to parse the document. Please suggest me how to do this. Thanks.
Upvotes: 0
Views: 1574
Reputation: 111686
You cannot use an XML parser to parse a file that does not contain well-formed XML. (It does not have to be valid, just well-formed. For the difference, read Well-formed vs Valid XML.)
By definition, XML must be well-formed, otherwise it is not XML. Parsers in general have to have some fundamental constraints met in order to operate, and for XML parsers, it is well-formedness.
Either repair the file manually first to be well-formed XML, or open it programmatically and parse it as a text file using traditional parsing techniques. An XML parser cannot help you unless you have well-formed XML.
Upvotes: 2
Reputation: 1
BeautifulSoup in Python can handle incomplete xml really well. I use it to parse prefix of large XML files for preview.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<a><b>foo</b><b>bar<','xml')
<?xml version="1.0" encoding="unicode-escape"?>\n<a><b>foo</b><b>bar</b></a>
Upvotes: 0