ottrin
ottrin

Reputation: 35

Python - reading non well formed xml file

How I can read the XML file if it has XML with forbidden characters in the name attribute <,>,",'? XML has more than 30k rows and the target is pandas.dataframe

<rows>
<row number="164" item="9860404" name="160-30 Bracket" qty="1"/>
<row number="164" item="9860405" name="200-30 <> Bracket" qty="1" />
<row number="164" item="9860406" name="250-30 3/4" Bracket" qty="3" />
<row number="164" item="9860407" name="315-30 <-> Bracket" qty="4"/>
</rows>

Upvotes: 2

Views: 634

Answers (2)

Michael Kay
Michael Kay

Reputation: 163322

It's not an XML file so you can't read it using XML tools; you need non-XML tools. You'll only confuse people if you call it an XML file; much better to tell everyone you have a non-XML file.

There are some tools designed to repair bad XML, but bad XML comes in many shapes and sizes and it's never possible to produce exactly what you would want in every case.

Whoever generated this file has completely missed the point that using standard data formats is supposed to save everyone time and money. If the data still exists in some other form, then you should try and regenerate the XML and get it right.

If the errors are confined to incorrect use of < within attribute values then you can probably repair it using a regex-based tool (e.g. awk, Perl, or just a text editor). If it uses quotation marks within attribute values, then you're probably hosed, there's no way of distinguishing the quotes that were intended to be attribute delimiters from those that weren't.

Upvotes: 1

larsks
larsks

Reputation: 311606

You can parse your example data using HTMLParser parser from lxml.etree:

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> doc =etree.parse(open('data.xml'), parser=parser)
>>> [elem.get('name') for elem in doc.xpath('//row')]
['160-30 Bracket', '200-30 <> Bracket', '250-30 3/4', '315-30 <-> Bracket']

Note that parsing the data with the HTML parser wraps you document in <html> and <body> elements so that the document structure ends up looking like:

<html><body><rows>
<row number="164" item="9860404" name="160-30 Bracket" qty="1"/>
<row number="164" item="9860405" name="200-30 &lt;&gt; Bracket" qty="1"/>
<row number="164" item="9860406" name="250-30 3/4" bracket="" qty="3"/>
<row number="164" item="9860407" name="315-30 &lt;-&gt; Bracket" qty="4"/>
</rows>
</body></html>

Upvotes: 2

Related Questions