amphibient
amphibient

Reputation: 31230

Parsing XML with special chars using ElementTree

The GET service I try to parse using ElementTree, and whose content I don't control, contains a non-UTF8 special character:

respXML = response.content.decode("utf-8")

respRoot = ET.fromstring(respXML)

The second line throws

xml.etree.ElementTree.ParseError: reference to invalid character number: line 3591, column 39

How can I make sure that the XML gets parsed regardless of the character set, which I can later run a replacement against if I find illegal characters? For example, is there an encoding which includes everything? I understand I can do a search and replace of the input XML string but I would prefer to parse it first because my parsing converts it into a data structure which is more easily searchable.

The special character in question is &#25; but I would like to be able to ingest any character. The whole tag is <literal>Alzheimer&#25;s disease</literal>.

Upvotes: 3

Views: 4614

Answers (1)

amphibient
amphibient

Reputation: 31230

With a little help from @tdelaney, I was able to get past this hurdle by scrubbing the input XML as a string:

respXML = response.content.decode("utf-8")

scrubbedXML = re.sub('&.+[0-9]+;', '', respXML)

respRoot = ET.fromstring(scrubbedXML)

Upvotes: 1

Related Questions