paris_serviola
paris_serviola

Reputation: 439

How to make xml parser ignore invalid characters?

I am using python module lxml to parse xml files. However, some of the xml files contain invalid characters such as ® . Due to this, I am getting following error.

lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !

Bytes: 0xAE 0x0A 0x53 0x6F, line 45, column 91

-> Removing the character solves the problem.

I cannot tell the data provider to provide me xml without such character. To avoid duplication, I have tried following solution from stack overflow and it gave me same error.

parsed_doc = etree.parse(u, etree.XMLParser(encoding='utf-8', ns_clean=True, recover=True))

How do I ignore/escape such characters?

Upvotes: 2

Views: 5592

Answers (1)

paris_serviola
paris_serviola

Reputation: 439

As mentioned by @jwodder, the xml file was not encoded with utf-8 encoding even though it had utf-8 as encoding attribute. . I changed my encoding params to ISO-8859-1 in lxml parser.

parsed_doc = etree.parse(u, etree.XMLParser(encoding='ISO-8859-1', ns_clean=True, recover=True))

It worked perfectly.

Upvotes: 2

Related Questions