Reputation: 517
I have a large XML file (details of 2 million object) having contents similar to as shown below. Size of file is 657MB
<?xml version="1.0" encoding="UTF-8?>
<root>
<item>
<rank>1</rank>
<landinglink>www.google.com</landinglink>
<descrip>some text</descrip>
</item>
<item>
<rank>1</rank>
<landinglink>www.facebook.com</landinglink>
<descrip>some text</descrip>
</item>
<item>
<rank>1</rank>
<landinglink>www.xyz.com</landinglink>
<descrip>some text</descrip>
</item>
.
.
.
.
.
.
.
</root>
I am trying to print all the 'landinglink'. The code which I am using is as shown below.
import xml.etree.cElementTree as ET
for event, elem in ET.iterparse("filename.xml"):
if event == 'end' and elem.tag == 'item':
print elem.find('landinglink').text
but while executing the code it gives me following error.
Traceback (most recent call last):
File "D:/test.py", line 2, in <module>
for event, elem in ET.iterparse("filename.xml"):
File "<string>", line 91, in next
cElementTree.ParseError: not well-formed (invalid token): line 1338, column 298
This error keeps on repeating at different location. How to avoid this type of error. Any help will be highly appreciated.
Upvotes: 0
Views: 2614
Reputation: 6281
(posting as an answer for later readers)
If the bad token value is \xA0
, then the file isn't properly encoded as utf-8.
If the file only has 8-bit characters, you need to change the XML declaration to something else, probably <?xml version="1.0" encoding="iso-8859-1" ?>
.
Upvotes: 1