Tobias
Tobias

Reputation: 4292

How to parse broken XML in Python?

A sever I can't influence sends very broken XML.

Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86 (9 bytes) in a file that's declared as utf-8 with no DTD.

I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encountering this. I'd like to avoid BeautifulSoup for speed reasons.

What else is there?

Upvotes: 4

Views: 2137

Answers (2)

Steven D. Majewski
Steven D. Majewski

Reputation: 2167

Maybe something like:

import htmlentitydefs as ents
from lxml import etree  # or maybe 'html' , if the input is still more broken
def repl_ent(m): 
     return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )

Upvotes: 0

Manoj Govindan
Manoj Govindan

Reputation: 74765

BeautifulSoup is your best bet in this case. I suggest profiling before ruling out BeautifulSoup altogether.

Upvotes: 2

Related Questions