Simple web crawler very slow

Question

I have built a very simple web crawler to crawl ~100 small json files in the URL below. The issue is that the crawler takes more than an hour to complete. I find that hard to understand given how small the json files are. Am I doing something fundamentally wrong here?

def get_senate_vote(vote):
    URL = 'https://www.govtrack.us/data/congress/113/votes/2013/s%d/data.json' % vote
    response = requests.get(URL)
    json_data = json.loads(response.text)
    return json_data

def get_all_votes():
    all_senate_votes = []
    URL = "http://www.govtrack.us/data/congress/113/votes/2013"    
    response = requests.get(URL)           
    root = html.fromstring(response.content)
    for a in root.xpath('/html/body/pre/a'):
        link = a.xpath('text()')[0].strip()
        if link[0] == 's':
            vote = int(link[1:-1])
            try:
                vote_json = get_senate_vote(vote)
            except:
                return all_senate_votes
            all_senate_votes.append(vote_json)

    return all_senate_votes

vote_data = get_all_votes()

ishaan · Accepted Answer

Here is a rather simple code sample, I've calculated the time taken for each call. On my system its taking on an average 2 secs per request, and there are 582 pages to visit, so around 19 mins without printing the JSON to the console. In your case network time plus print time may increase it.

#!/usr/bin/python

import requests
import re
import time
def find_votes():
    r=requests.get("https://www.govtrack.us/data/congress/113/votes/2013/")
    data = r.text
    votes = re.findall('s\d+',data)
    return votes

def crawl_data(votes):
    print("Total pages: "+str(len(votes)))
    for x in votes:
        url ='https://www.govtrack.us/data/congress/113/votes/2013/'+x+'/data.json'
        t1=time.time()
        r=requests.get(url)
        json = r.json()
        print(time.time()-t1)
crawl_data(find_votes())

Simple web crawler very slow

Answers (2)

Related Questions