Mohammad Abu Musa
Mohammad Abu Musa

Reputation: 1133

Performance Optimization of scraping code

I am studying web scraping for big data, so I wrote the following code to take some information from a local server on our campus. It works fine but I think the performance is very slow; each record takes 0.91s to get stored in the database. What the code does is open a web page, take some content and store it on disk.

My goal is to lower the time elapsed for scraping a record to something near 0.4s (or less, if possible).

#!/usr/bin/env python

import scraperwiki
import requests
import lxml.html
for i in range(1, 150):
    try:
        html = requests.get("http://testserver.dc/"+str(i)"/").content 
        dom = lxml.html.fromstring(html)
        for entry in dom.cssselect('.rTopHeader'):
            name = entry.cssselect('.bold')[0].text_content()

        for entry in dom.cssselect('div#rProfile'):
            city = entry.cssselect('li:nth-child(2) span')[0].text_content()

        for entry in dom.cssselect('div#rProfile'):
            profile_id = entry.cssselect('li:nth-child(3) strong a')[0].get('href')
        profile = {
                'name':name,
                'city':city,
                'profile_id':profile_id
            }
        unique_keys = [ 'profile_id' ]
        scraperwiki.sql.save(unique_keys, profile)
        print jeeran_id            
    except:
        print 'Error: ' + str(i)

Upvotes: 0

Views: 412

Answers (1)

Jan Vlcinsky
Jan Vlcinsky

Reputation: 44112

It is very good, you have clear aim at how far you want to optimize.

Measure time needed for scraping first

It is likely, the limiting factor is scraping the urls.

Simplify your code and measure, how long it takes to scrape. If this does not meet your timing criteria (like one request would take 0.5 seconds), you have to go do scraping in parallel. Search StackOverflow, there are many such questions and answers, using threading, green threads etc.

Further optimization tips

Parsing XML - use iterative / SAX approach

Your DOM creation can be turned into iterative parsing. It need less memory and is very often much faster. lxml allows methods like iterparse. Also see related SO answer

Optimize writing into database

Writing many records one by one can be turned into writing them in bunches.

Upvotes: 2

Related Questions