nishant kumar
nishant kumar

Reputation: 517

cElementTree.ParseError: not well-formed (invalid token)

I have a large XML file (details of 2 million object) having contents similar to as shown below. Size of file is 657MB

<?xml version="1.0" encoding="UTF-8?>
<root>
    <item>
        <rank>1</rank>
        <landinglink>www.google.com</landinglink>
        <descrip>some text</descrip>
    </item>
    <item>
        <rank>1</rank>
        <landinglink>www.facebook.com</landinglink>
        <descrip>some text</descrip>
    </item>
    <item>
        <rank>1</rank>
        <landinglink>www.xyz.com</landinglink>
        <descrip>some text</descrip>
    </item>
    .
    .
    .
    .
    .
    .
    .
</root>

I am trying to print all the 'landinglink'. The code which I am using is as shown below.

import xml.etree.cElementTree as ET
for event, elem in ET.iterparse("filename.xml"):
    if event == 'end' and elem.tag == 'item':
        print elem.find('landinglink').text

but while executing the code it gives me following error.

    Traceback (most recent call last):
  File "D:/test.py", line 2, in <module>
    for event, elem in ET.iterparse("filename.xml"):
  File "<string>", line 91, in next
cElementTree.ParseError: not well-formed (invalid token): line 1338, column 298

This error keeps on repeating at different location. How to avoid this type of error. Any help will be highly appreciated.

Upvotes: 0

Views: 2614

Answers (1)

cco
cco

Reputation: 6281

(posting as an answer for later readers)

If the bad token value is \xA0, then the file isn't properly encoded as utf-8.
If the file only has 8-bit characters, you need to change the XML declaration to something else, probably <?xml version="1.0" encoding="iso-8859-1" ?>.

Upvotes: 1

Related Questions