user3642173
user3642173

Reputation: 1255

Simple web crawler very slow

I have built a very simple web crawler to crawl ~100 small json files in the URL below. The issue is that the crawler takes more than an hour to complete. I find that hard to understand given how small the json files are. Am I doing something fundamentally wrong here?

def get_senate_vote(vote):
    URL = 'https://www.govtrack.us/data/congress/113/votes/2013/s%d/data.json' % vote
    response = requests.get(URL)
    json_data = json.loads(response.text)
    return json_data

def get_all_votes():
    all_senate_votes = []
    URL = "http://www.govtrack.us/data/congress/113/votes/2013"    
    response = requests.get(URL)           
    root = html.fromstring(response.content)
    for a in root.xpath('/html/body/pre/a'):
        link = a.xpath('text()')[0].strip()
        if link[0] == 's':
            vote = int(link[1:-1])
            try:
                vote_json = get_senate_vote(vote)
            except:
                return all_senate_votes
            all_senate_votes.append(vote_json)

    return all_senate_votes

vote_data = get_all_votes()

Upvotes: 2

Views: 986

Answers (2)

Yuval Pruss
Yuval Pruss

Reputation: 9826

If you are using python 3.x and you are crawling multiple sites, for even better performances I offer warmly to you to use the aiohttp module, which implements the asynchronous principles. For example:

import aiohttp
import asyncio

sites = ['url_1', 'url_2']
results = []

def save_reponse(result):
    site_content = result.result()
    results.append(site_content)

async def crawl_site(site):
    async with aiohttp.ClientSession() as session:
        async with session.get(site) as resp:
            resp = await resp.text()
            return resp

tasks = []
for site in sites:
    task = asyncio.ensure_future(crawl_site(site))
    task.add_done_callback(save_reponse)
    tasks.append(task)
all_tasks = asyncio.gather(*tasks)

loop = asyncio.get_event_loop()
loop.run_until_complete(all_tasks)
loop.close()

print(results) 

For more reading about aiohttp.

Upvotes: 1

ishaan
ishaan

Reputation: 2031

Here is a rather simple code sample, I've calculated the time taken for each call. On my system its taking on an average 2 secs per request, and there are 582 pages to visit, so around 19 mins without printing the JSON to the console. In your case network time plus print time may increase it.

#!/usr/bin/python

import requests
import re
import time
def find_votes():
    r=requests.get("https://www.govtrack.us/data/congress/113/votes/2013/")
    data = r.text
    votes = re.findall('s\d+',data)
    return votes

def crawl_data(votes):
    print("Total pages: "+str(len(votes)))
    for x in votes:
        url ='https://www.govtrack.us/data/congress/113/votes/2013/'+x+'/data.json'
        t1=time.time()
        r=requests.get(url)
        json = r.json()
        print(time.time()-t1)
crawl_data(find_votes())

Upvotes: 1

Related Questions