Reputation: 2837
I've got some badly-formed XML files using Python, and I need to figure out what's wrong with them (ie. what the errors are) without actually looking at the data (the files are sensitive client data).
I figure there should be a way to sanitize the XML (ie. remove all content in all nodes) but keep the tags, so that I can see any structural issues.
However, ElementTree doesn't return any detailed information about mismatched tags - just a line number and a character position which is useless if I can't reference the original XML.
Does anyone know how I can either sanitize the XML so I can view it, or get more detailed error messages for badly formed XML (that won't return tag contents)? I could write a customer parser to strip content, but I wanted to exhaust other options first.
Upvotes: 0
Views: 598
Reputation: 111686
It's a hard enough problem to try to automatically fix markup problems when you can look at the file. If you're not permitted to see the document contents, forget about having any reasonable hope of fixing such doubly undefined problems.
Your best bet is to fix the bad "XML" at its source.
If you can't do that, I suggest that you use a tool listed in How to parse invalid (bad / not well-formed) XML? to attempt to automatically repair the well-formedness problem. Then, after you actually have XML, you can use XML tools to strip or sanitize content (if that's even still necessary at that point).
Upvotes: 1