Reputation: 31230
The GET service I try to parse using ElementTree
, and whose content I don't control, contains a non-UTF8 special character:
respXML = response.content.decode("utf-8")
respRoot = ET.fromstring(respXML)
The second line throws
xml.etree.ElementTree.ParseError: reference to invalid character number: line 3591, column 39
How can I make sure that the XML gets parsed regardless of the character set, which I can later run a replacement against if I find illegal characters? For example, is there an encoding which includes everything? I understand I can do a search and replace of the input XML string but I would prefer to parse it first because my parsing converts it into a data structure which is more easily searchable.
The special character in question is 
but I would like to be able to ingest any character. The whole tag is <literal>Alzheimers disease</literal>
.
Upvotes: 3
Views: 4614
Reputation: 31230
With a little help from @tdelaney, I was able to get past this hurdle by scrubbing the input XML as a string:
respXML = response.content.decode("utf-8")
scrubbedXML = re.sub('&.+[0-9]+;', '', respXML)
respRoot = ET.fromstring(scrubbedXML)
Upvotes: 1