Performance Optimization of scraping code

Question

I am studying web scraping for big data, so I wrote the following code to take some information from a local server on our campus. It works fine but I think the performance is very slow; each record takes 0.91s to get stored in the database. What the code does is open a web page, take some content and store it on disk.

My goal is to lower the time elapsed for scraping a record to something near 0.4s (or less, if possible).

#!/usr/bin/env python

import scraperwiki
import requests
import lxml.html
for i in range(1, 150):
    try:
        html = requests.get("http://testserver.dc/"+str(i)"/").content 
        dom = lxml.html.fromstring(html)
        for entry in dom.cssselect('.rTopHeader'):
            name = entry.cssselect('.bold')[0].text_content()

        for entry in dom.cssselect('div#rProfile'):
            city = entry.cssselect('li:nth-child(2) span')[0].text_content()

        for entry in dom.cssselect('div#rProfile'):
            profile_id = entry.cssselect('li:nth-child(3) strong a')[0].get('href')
        profile = {
                'name':name,
                'city':city,
                'profile_id':profile_id
            }
        unique_keys = [ 'profile_id' ]
        scraperwiki.sql.save(unique_keys, profile)
        print jeeran_id            
    except:
        print 'Error: ' + str(i)

Performance Optimization of scraping code

Answers (1)

Measure time needed for scraping first

Further optimization tips

Parsing XML - use iterative / SAX approach

Optimize writing into database

Related Questions