Performance issues while scraping website data with Python

Question

I am trying to scrap data with Python from a website that contains around 4000 pages which consist of 25 links per page.

My problem is that after around 200 processed pages the performance gets so horrendous that even other programs on my computer freeze.

I guess it is something about me not working with the memory correctly or something similiar. I would appreciate it greatly if someone could help me out on this matter to get my script running more smoothly and less demanding to my system.

Thanks in advance for every help. :)

EDIT: I found the solution you can find it in the answer i gave when you scroll down a bit. Thanks to everyone that tried to help me, especially etna and Walter A that gave good suggestions for me to get on the right track. :)

from pprint import pprint
from lxml import etree
import itertools
import requests

def function parsePageUrls(page):
    return page.xpath('//span[@class="tip"]/a/@href')

def function isLastPage(page):
    if not page.xpath('//a[@rel="next"]'):
        return True

urls = []
for i in itertools.count(1):
    content = requests.get('http://www.example.com/index.php?page=' + str(i), allow_redirects=False)
    page = etree.HTML(content.text)

    urls.extend(parsePageUrls(page))

    if isLastPage(page):
        break

pprint urls

Octavia Kitsune · Accepted Answer

I finally found the solution. The problem was that i thought i work with a list of strings as return value of tree.xpath, but instead it was a list of _ElementUnicodeResult-Objects that blocked the GC from clearing the memory because they held references to their parent.

So the solution is to transform these _ElementUnicodeResult-Objects into a normal string to get rid of the references.

Here is the source that helped me out understanding the issue: http://lxml.de/api/lxml.etree._ElementTree-class.html#xpath

As for the provided code the following fixed it:

Instead of:

urls.extend(parsePageUrls(page))

It had to be:

  for url in parsePageUrls(page):
    urls.append(str(url))

Performance issues while scraping website data with Python

Answers (1)

Related Questions