Reputation: 4062
Trying to parse the following Python file using the lxml.etree.iterparse function.
"sampleoutput.xml"
<item>
<title>Item 1</title>
<desc>Description 1</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2</desc>
</item>
I tried the code from Parsing Large XML file with Python lxml and Iterparse
before the etree.iterparse(MYFILE) call I did MYFILE = open("/Users/eric/Desktop/wikipedia_map/sampleoutput.xml","r")
But it turns up the following error
Traceback (most recent call last):
File "/Users/eric/Documents/Programming/Eclipse_Workspace/wikipedia_mapper/testscraper.py", line 6, in <module>
for event, elem in context :
File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:98565)
File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:99086)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 5, column 1
any ideas? thank you!
Upvotes: 11
Views: 9670
Reputation: 27363
The problem is that XML isn't well-formed if it doesn't have exactly one top-level tag. You can fix your sample by wrapping the entire document in <items></items>
tags. You also need the <desc/>
tags to match the query that you're using (description
).
The following document produces correct results with your existing code:
<items>
<item>
<title>Item 1</title>
<description>Description 1</description>
</item>
<item>
<title>Item 2</title>
<description>Description 2</description>
</item>
</items>
Upvotes: 14
Reputation: 9341
As far as I know, xml.etree.ElementTree usually expects the XML file to contain one "root" element, i.e. one XML tag that encloses the complete document structure. From the error message you posted I would assume that this is the problem here as well:
´Line 5´ refers to the second <item>
tag, so I guess Python complains that there is more data following after the assumed root element (i.e. the first <item>
tag) was closed.
Upvotes: 5