Felixasdf
Felixasdf

Reputation: 13

Iterparsing a HUGE xml file using python but getting a error

I'm trying to parse a huge XML file using python but I'm getting this error:

    File "parser.py", line 6, in <module>
        event, root = text.next()
    File "C:\Python27\lib\xml\etree\ElementTree.py", line 1281, in next
        self._root = self._parser.close()
    File "C:\Python27\lib\xml\etree\ElementTree.py", line 1654, in close
        self._raiseerror(v)
    File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
        raise err
    xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

My code right now looks like this

    import xml.etree.ElementTree as ET
    from StringIO import StringIO

    text = ET.iterparse(StringIO('Posts.xml'), events=('start', 'end', 'start-ns', 'end-ns'))
    text = iter(text)
    event, root = text.next()

    for event, elem in text:
        currId = elem.get('PostTypeId')
        if (currId != '1'):
            root.remove(elem)

    tree.write('cut.xml')

The XML file Im trying to parse looks something like this:

    <posts>

     <row FavoriteCount="4" CommentCount="4" AnswerCount="7" Tags="<discussion><answers>" Title="Why would anyone accept an answer?" LastActivityDate="2014-04-23T09:14:37.103" LastEditDate="2010-09-03T00:42:07.733" LastEditorUserId="99" OwnerUserId="4" Body="<p>I'm looking at the questions proposed during the Area 51 process:</p> <ul> <li>My supervisor thinks that all <code>If</code> statements should include <code>else</code> statements. Do you agree?</li> <li>What are common mistakes in Software Development?</li> <li>Tabs vs. Spaces: What is the one proper indentation character for everything, in every situation, ever?</li> <li>What programming language should I teach to my 4 year old son?</li> <li>What was the turning point of your programming career?</li> </ul> <p>None of these have an answer that should be accepted. The questions are interesting, and the answers would also be informative if the answer was well written and explained why the answerer thinks his method or idea is better. But I can't really see being able to accept an answer to any of these questions.</p> <p>So, if I ask a question, how do I decide if or how to accept an answer? There is no right or wrong answer and just because it works for me doesn't mean I should be floating that answer to the top - unless I'm overlooking something, the questions that are on topic here are very subjective. On Stack Overflow, there are often multiple right solutions to a problem. Here, we have a problem with an infinite number of solutions, none of which are arguably better or worse than any others.</p> <p>Thoughts?</p> " ViewCount="1582" Score="30" CreationDate="2010-09-01T19:32:45.710" PostTypeId="1" Id="1"/>

    <row CommentCount="0" AnswerCount="4" Tags="<discussion><site-attributes><faq-contents><top-7>" Title="What should our FAQ contain?" LastActivityDate="2015-03-18T19:19:24.887" LastEditDate="2015-03-18T19:19:24.887" LastEditorUserId="25936" OwnerUserId="9" Body="<p>One of the big 7 questions.</p> " ViewCount="318" Score="6" CreationDate="2010-09-01T19:34:51.797" PostTypeId="1" Id="2" CommunityOwnedDate="2010-09-02T03:42:26.083"/>

     <row FavoriteCount="8" CommentCount="8" AnswerCount="32" Tags="<discussion><top-7><site-attributes>" Title="What should our domain name be?" LastActivityDate="2014-04-23T09:14:37.103" LastEditDate="2010-12-20T02:46:31.950" LastEditorUserId="2314" OwnerUserId="9" Body="<blockquote> <p><strong>Possible Duplicate:</strong><br> <a href="http://meta.programmers.stackexchange.com/questions/412/write-an-elevator-pitch-tagline">Write an Elevator Pitch / Tagline</a> </p> </blockquote> <h2>Note:</h2> <p>We are closing this domain naming thread. It is asking the <em>entirely</em> wrong question. See this blog post for details: <a href="http://blog.stackoverflow.com/2010/10/domain-names-the-wrong-question/" rel="nofollow">Domain Names: Wrong Question</a> </p> <p>We're going to keep the name programmers.stackexchange.com. But we WILL be setting up redirects from the more "popular" domains names. (e.g. seasonedadvice.com to cooking.stackexchange.com, basicallymoney.com to money.stackexchange.com, and others as we go through the list).</p> <p>New question: "<strong>Write an Elevator Pitch / Tagline!</strong>"</p> <p><a href="http://meta.programmers.stackexchange.com/questions/412/write-an-elevator-pitch-tagline"><strong>Click here to contribute ideas and vote.</strong></a> </p> <p><em>[original message text below]</em></p> <p>One of the big 7 questions.</p> <ul> <li>One answer per answer please</li> <li>Only .com domain names please</li> <li>Only untaken domain names please (use whois)</li> </ul> <p>Please use <strong>lowercase characters only</strong> in domain name!<br> DomainName.com is more readable, but we have to register domainname.com!</p> " ViewCount="1146" Score="16" CreationDate="2010-09-01T19:36:08.390" PostTypeId="1" Id="3" CommunityOwnedDate="2010-09-02T03:40:00.467" ClosedDate="2010-10-08T21:02:50.313"/>
    ...

    </posts>

Upvotes: 1

Views: 332

Answers (3)

John Gordon
John Gordon

Reputation: 33351

You're not reading the file correctly.

StringIO('Posts.xml') does NOT read the file; it creates a file-like object with the contents "Posts.xml".

That's why iterparse is complaining; the content does not start with <.

Upvotes: 1

Georg Grab
Georg Grab

Reputation: 2301

ElementTree.iterparse expects some kind of source. You're providing providing it with a string buffer with the content Posts.xml instead of the actual contents of the file Posts.xml, which has obviously not the correct Syntax of an xml file.

So, just get rid of the StringIO call and ElementTree will handle opening the file for you. There are however some more problems with your input file which prevent your file from being parsed correctly (see sverasch's answer).

Upvotes: 1

sverasch
sverasch

Reputation: 110

I ran your sample xml through xmllint ( http://linux.die.net/man/1/xmllint ) and discovered that you have unescaped less than and greater than signs.

> <

should be

&gt; &lt; 

When it's parsing, it thinks it has come to a new tag, or a close tag prematurely.

Upvotes: 1

Related Questions