Reputation: 439
I am using python module lxml to parse xml files. However, some of the xml files contain invalid characters such as ® . Due to this, I am getting following error.
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xAE 0x0A 0x53 0x6F, line 45, column 91
-> Removing the character solves the problem.
I cannot tell the data provider to provide me xml without such character. To avoid duplication, I have tried following solution from stack overflow and it gave me same error.
parsed_doc = etree.parse(u, etree.XMLParser(encoding='utf-8', ns_clean=True, recover=True))
How do I ignore/escape such characters?
Upvotes: 2
Views: 5592
Reputation: 439
As mentioned by @jwodder, the xml file was not encoded with utf-8 encoding even though it had utf-8 as encoding attribute. . I changed my encoding params to ISO-8859-1 in lxml parser.
parsed_doc = etree.parse(u, etree.XMLParser(encoding='ISO-8859-1', ns_clean=True, recover=True))
It worked perfectly.
Upvotes: 2