Chris
Chris

Reputation: 91

Python - Request GZ file and Parsing XML

I started learning Python a few days ago in order to build a basic site in order to compile some statistics from BOINC projects eg SETI@home etc.

Basically the site does:

In total there are 34 .gz files from 34 different BOINC projects.

All the code is now finished and works, however the .gz file from one project refuses to parse, whereas the other 34 work fine.

The file is:

user.gz

from

http://www.rnaworld.de/rnaworld/stats/

These are the errors that I am getting:

Traceback (most recent call last):
  File "C:/Users/chris/PycharmProjects/testproject1/rnaw100.py", line 77, in <module>
    for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1227, in iterator
    yield from pullparser.read_events()
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1302, in read_events
    raise event
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1274, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

This is the code that downloads the .gz file and parse's the XML: (I have left out var declarations etc)

As a newbie I am finding it difficult to understand what is wrong, as (a) the errors refers to a Python core file eg ElementTree.py, and (b) I can't understand why a .gz file which many other BOINC stat sites are using wont work here, and (c) why my code works on 34 files, but not this 1.

response = requests.get(url2, stream=True)

if response.status_code == 200:
    with open(target_path2, 'wb') as f:
        f.write(response.raw.read())

with gzip.open(target_path2, 'rb') as f_in:
    with open(x_file_name2, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):

    if elem.tag == "total_credit" and event == "end":
        tc=float(elem.text)
        elem.clear

    if elem.tag == "expavg_credit" and event == "end":
        ac=float(elem.text)
        elem.clear

    if elem.tag == "id" and event == "end":
        id=elem.text
        elem.clear

    if elem.tag == "cpid" and event == "end":
        cpid=elem.text
        elem.clear

    if elem.tag == "name" and event == "end":
        name = elem.text
        elem.clear()
    teamid=TEAMID

    if elem.tag == "teamid" and event == "end":
        if elem.text == TEAMID:
            cnt=cnt+1
            dic[id]={"Name":name,"CPID":cpid, "TC":tc, "AC":ac}
        elem.clear()

Upvotes: 0

Views: 1672

Answers (1)

dabingsou
dabingsou

Reputation: 2469

Another solution.

from simplified_scrapy import SimplifiedDoc,req,utils
import gzip
with gzip.open('user.gz', 'rb') as f_in:
  with open('user.xml', 'wb') as f_out:
    f_out.write(f_in.read())
html = utils.getFileContent('user.xml')
doc = SimplifiedDoc(html)
users = doc.selects('user')
for user in users:
  tags = user.children

@Chris I decompress the file and save it. The data is correct. Try replacing your shutil with it.

import gzip
with gzip.open('user.gz', 'rb') as f_in:
    with open('user.xml', 'wb') as f_out:
        f_out.write(f_in.read())

Upvotes: 0

Related Questions