Python - Request GZ file and Parsing XML

Question

I started learning Python a few days ago in order to build a basic site in order to compile some statistics from BOINC projects eg SETI@home etc.

Basically the site does:

Download gz files
Uncompress gz files into xml files
Build xml info into data structures
Write data structures back into cvs files

In total there are 34 .gz files from 34 different BOINC projects.

All the code is now finished and works, however the .gz file from one project refuses to parse, whereas the other 34 work fine.

The file is:

user.gz

from

http://www.rnaworld.de/rnaworld/stats/

These are the errors that I am getting:

Traceback (most recent call last):
  File "C:/Users/chris/PycharmProjects/testproject1/rnaw100.py", line 77, in 
    for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1227, in iterator
    yield from pullparser.read_events()
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1302, in read_events
    raise event
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1274, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

This is the code that downloads the .gz file and parse's the XML: (I have left out var declarations etc)

As a newbie I am finding it difficult to understand what is wrong, as (a) the errors refers to a Python core file eg ElementTree.py, and (b) I can't understand why a .gz file which many other BOINC stat sites are using wont work here, and (c) why my code works on 34 files, but not this 1.

response = requests.get(url2, stream=True)

if response.status_code == 200:
    with open(target_path2, 'wb') as f:
        f.write(response.raw.read())

with gzip.open(target_path2, 'rb') as f_in:
    with open(x_file_name2, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):

    if elem.tag == "total_credit" and event == "end":
        tc=float(elem.text)
        elem.clear

    if elem.tag == "expavg_credit" and event == "end":
        ac=float(elem.text)
        elem.clear

    if elem.tag == "id" and event == "end":
        id=elem.text
        elem.clear

    if elem.tag == "cpid" and event == "end":
        cpid=elem.text
        elem.clear

    if elem.tag == "name" and event == "end":
        name = elem.text
        elem.clear()
    teamid=TEAMID

    if elem.tag == "teamid" and event == "end":
        if elem.text == TEAMID:
            cnt=cnt+1
            dic[id]={"Name":name,"CPID":cpid, "TC":tc, "AC":ac}
        elem.clear()

Python - Request GZ file and Parsing XML

Answers (1)

Related Questions