Reputation: 4292
A sever I can't influence sends very broken XML.
Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86
(9 bytes) in a file that's declared as utf-8 with no DTD.
I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encountering this. I'd like to avoid BeautifulSoup for speed reasons.
What else is there?
Upvotes: 4
Views: 2137
Reputation: 2167
Maybe something like:
import htmlentitydefs as ents
from lxml import etree # or maybe 'html' , if the input is still more broken
def repl_ent(m):
return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )
Upvotes: 0
Reputation: 74765
BeautifulSoup
is your best bet in this case. I suggest profiling before ruling out BeautifulSoup
altogether.
Upvotes: 2