Reputation: 1133
I am studying web scraping for big data, so I wrote the following code to take some information from a local server on our campus. It works fine but I think the performance is very slow; each record takes 0.91s to get stored in the database. What the code does is open a web page, take some content and store it on disk.
My goal is to lower the time elapsed for scraping a record to something near 0.4s (or less, if possible).
#!/usr/bin/env python
import scraperwiki
import requests
import lxml.html
for i in range(1, 150):
try:
html = requests.get("http://testserver.dc/"+str(i)"/").content
dom = lxml.html.fromstring(html)
for entry in dom.cssselect('.rTopHeader'):
name = entry.cssselect('.bold')[0].text_content()
for entry in dom.cssselect('div#rProfile'):
city = entry.cssselect('li:nth-child(2) span')[0].text_content()
for entry in dom.cssselect('div#rProfile'):
profile_id = entry.cssselect('li:nth-child(3) strong a')[0].get('href')
profile = {
'name':name,
'city':city,
'profile_id':profile_id
}
unique_keys = [ 'profile_id' ]
scraperwiki.sql.save(unique_keys, profile)
print jeeran_id
except:
print 'Error: ' + str(i)
Upvotes: 0
Views: 412
Reputation: 44112
It is very good, you have clear aim at how far you want to optimize.
It is likely, the limiting factor is scraping the urls.
Simplify your code and measure, how long it takes to scrape. If this does not meet your timing criteria (like one request would take 0.5 seconds), you have to go do scraping in parallel. Search StackOverflow, there are many such questions and answers, using threading, green threads etc.
Your DOM creation can be turned into iterative parsing. It need less memory and is very often much faster. lxml
allows methods like iterparse
. Also see related SO answer
Writing many records one by one can be turned into writing them in bunches.
Upvotes: 2