ChaimG
ChaimG

Reputation: 7522

Cleaning up broken XML in Python

A server I don't control sends broken XML with characters such as '>', '&', '<' etc. in attributes and text.

A small sample:

<StockFormula Description="" Name="F_ΔTURN" RankType="Higher" Scope="Universe" Weight="10.86%">
    <Formula>AstTurnTTM>AstTurnPTM</Formula>
</StockFormula>
<Composite Name="Piotroski & Trends - <11@4w600k 70b" Weight="0%" RankType="Higher">
</Composite>

I settled on using the lxml module because it is case sensitive, is very fast and gets the job done.

How would I go about fixing up this type of XML? Basically, I am trying to replace all occurrences of the invalid characters with the proper escape sequences.

import re

broken = '<StockFormula Description="" Name="F_ΔTURN" RankType="Higher" Scope="Universe" Weight="10.86%">\n<Formula>AstTurnTTM>AstTurnPTM</Formula>\n<Composite Name="Piotroski & Trends - <11@4w600k 70b" Weight="0%" RankType="Higher">\n</Composite>'
print re.sub(r'(.*Name=".*)&(")', r'\g<1>&gt;\g<2>', broken)

Output:

<StockFormula Description="" Name="F_ÃŽâ€TURN" RankType="Higher" Scope="Universe" Weight="10.86%">
    <Formula>AstTurnTTM>AstTurnPTM</Formula>
</StockFormula>
<Composite Name="Piotroski & Trends - <11@4w600k 70b" Weight="0%" RankType="Higher">
</Composite>

Upvotes: 1

Views: 995

Answers (1)

kjhughes
kjhughes

Reputation: 111686

First, realize no XML parser can help you with "broken XML." XML parsers only operate over XML, which by definition must be well-formed.

Second, it is not possible to repair "broken XML" in the general case. There are no rules governing "broken XML." Without a clear definition of "broken XML," you cannot be guaranteed to be able to process it and convert it to real XML.

That said, HTML Tidy does a decent job of repairing (X)HTML, and it has limited capabilities for repairing XML as well. It's your best bet for automated repair of "broken XML." There is a Python package, PyTidyLib, which wraps the HTML Tidy library.

Upvotes: 3

Related Questions