Possible bottle-neck issue in web-scraping with Python

Question

First of all I apologize for the vague title, but the problem is that I'm not sure what is causing the error.

I'm using Python to extrapolate some data from a website. The code I created works perfectly when passing one link at the time, but somehow breaks when trying to collect the data from the 8000 pages I have (it actually breaks way before). The process I need to do is this:

Collect all the links from one single page (8000 links)
From each link extrapolate another link contained in an iframe
Scrape the date from the link in 2.

Point 1 is easy and works fine. Point 2 and 3 works for a while and then I get some errors. Every time at a different point and it's never the same. After some tests, I decided to try a different approach and run my code until point 2 on all the links in 1, trying to collect all the links first. And at this point I found out that, probably, I get the error during this stage. The code works like this: in a for cycle I pass each item of a list of urls to the function below. It's supposed to search for a link to the Disqus website. There should be only one link and there is always one link. Because with a library as lxml, it's not possible to scan inside the iframe, I use selenium and the ChromeDriver.

def get_url(webpage_url):
    chrome_driver_path= '/Applications/chromedriver' 
    driver = webdriver.Chrome(chrome_driver_path) 
    driver.get(webpage_url)
    iframes=driver.find_elements_by_tag_name("iframe")
    list_urls=[]
    urls=[]

    # collects all the urls of all the iframe tags
    for iframe in iframes:
        driver.switch_to_frame(iframe)
        time.sleep(3)
        list_urls.append(driver.current_url)
        driver.switch_to_default_content()
    driver.quit()

    for item in list_urls:
        if item.startswith('http://disqus'):
            urls.append(item)

    if len(urls)>1:
        print "too many urls collected in iframes"
    else:
        url=urls[0]

    return url

At the beginning there was no time.sleep and it worked for roughly 30 links. Then I put a time.sleep(2) and it arrived to about 60. Now with time.sleep(3) it works for around 130 links. Of course, this cannot be a solution. The error I get now, it's always the same (index out of range in url=urls[0]), but each time with a different link. If I check my code with the single link where it breaks, the code works, so it can actually find urls there. And of course, sometimes passes a link where it stopped before and it works with no issue. I suspect I get this because maybe of a time-out, but of course I'm not sure.

So, how can I understand what's the issue, here?

If the problem is that it makes too many requests (even though the sleep), how can I deal with this?

Thank you.

bruno desthuilliers · Accepted Answer

From your description of the problem, it might be that the host throttles your client when you issue too many requests in a given time. This is a common protection againts DoS attacks and ill-behaved robots - like yours.

The clean solution here is to checkout if the site has a robots.txt file and if so parse it and respect the rules - else, set a large enough wait time between two requests so you dont get kicked.

Also you can get quite a few other issues - 404, lost network connection etc - and even load time issues with selenium.webdriver as documented here:

Dependent on several factors, including the OS/Browser combination, WebDriver may or may not wait for the page to load. In some circumstances, WebDriver may return control before the page has finished, or even started, loading. To ensure robustness, you need to wait for the element(s) to exist in the page using Explicit and Implicit Waits.

wrt/ your IndexError, you blindly assume that you'll get at least one url (which means at least one iframe), which might not be the case for any of the reasons above (and a few others too). First you want to make sure you properly handle all corner cases, then fix your code so you don't assume that you do have at least one url:

url = None
if len(urls) > 1:
    print "too many urls collected in iframes"
elif len(urls) == 0:
    url = urls[0]
else:
    print "no url found"

Also if all you want is the first http://disqus url you can find, no need to collect them all, then filter them out, then return the first:

def get_url(webpage_url):
    chrome_driver_path= '/Applications/chromedriver' 
    driver = webdriver.Chrome(chrome_driver_path) 
    driver.get(webpage_url)
    iframes=driver.find_elements_by_tag_name("iframe")
    # collects all the urls of all the iframe tags
    for iframe in iframes:
        driver.switch_to_frame(iframe)
        time.sleep(3)
        if driver.current_url.startswith('http;//disqus'):
            return driver.current_url 
        driver.switch_to_default_content()
    driver.quit()
    return None # nothing found

Possible bottle-neck issue in web-scraping with Python

Answers (1)

Related Questions