user13972138
user13972138

Reputation:

Why I can't scrape that large XML file using Python?

Does anyone know why this code doesn't do the job? It works perfectly when I want to scrape smaller files with data from a certain date e.g only from 2017 but not with this one. Is this file too big or something? There's no error or anything like that. Every time I run this script but with mentioned smaller file It takes about 30 seconds to download everything and save into a database so there are no mistakes in code I think. After running the script I'm just getting "Process finished with exit code 0" and nothing more.

from bs4 import BeautifulSoup
import urllib.request
from app import db
from models import CveData
from sqlalchemy.exc import IntegrityError


url = "https://cve.mitre.org/data/downloads/allitems.xml"
r = urllib.request.urlopen(url)

xml = BeautifulSoup(r, 'xml')
vuln = xml.findAll('Vulnerability')

for element in vuln:
    note = element.findAll('Notes')
    title = element.find('CVE').text 

    for element in note:
        desc = element.find(Type="Description").text
        test_date = element.find(Title="Published")

        if test_date is None:
            pass
        else:
            date = test_date.text
            data = CveData(title,date,desc)
            try:
                db.session.add(data)
                db.session.commit()
                print("adding... " + title)

            # don't stop the stream, ignore the duplicates
            except IntegrityError:
                db.session.rollback()

Upvotes: 0

Views: 156

Answers (1)

Aaron
Aaron

Reputation: 2093

I downloaded the file that you said didn't work, and the one you said did and ran these two greps with different results:

grep -c "</Vulnerability>" allitems-cvrf-year-2019.xml
21386

grep -c "</Vulnerability>" allitems.xml
0

The program is not stopping on opening the file, it is running to completion. You aren't getting any output because there are no Vulnerability tags in the xml file. (Now my grep is not technically accurate, as I believe there could be spaces in the Vulnerability closing tag, but I doubt that is the case here.)

Upvotes: 1

Related Questions