PoweredByCoffee
PoweredByCoffee

Reputation: 1193

Python Selenium Failing When Some Threads Create Webdriver

I have a thread which takes a URL, requests it in selenium and parses up the data.

Most of the time this thread works fine. But sometimes it seems to hang on creating the webdriver and I can't seem to exception handle it.

This is the start of the thread:

def GetLink(eachlink):

    trry = 0 #10 Attempts at getting the data

    while trry < 10:

        print "Scraping:  ", eachlink
        try:

            Numbergrab = []
            Namegrab = []
            Positiongrab = []

            nextproxy = (random.choice(ProxyList))
            nextuseragent = (random.choice(UseragentsList))
            proxywrite = '--proxy=',nextproxy
            service_args = [
            proxywrite,
            '--proxy-type=http',
            '--ignore-ssl-errors=true',
            ]

            dcap = dict(DesiredCapabilities.PHANTOMJS)
            dcap["phantomjs.page.settings.userAgent"] = (nextuseragent)
            pDriver = webdriver.PhantomJS('C:\phantomjs.exe',desired_capabilities=dcap, service_args=service_args)
            pDriver.set_window_size(1024, 768) # optional
            pDriver.set_page_load_timeout(20)

            print "Requesting link: ", eachlink
            pDriver.get(eachlink)
            try:
                WebDriverWait(pDriver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='seat-setting']")))
            except:
                time.sleep(10)

That's a snippet but that's the important part because when it's working it'll continue fine.

But when something stalls one of the threads will send a "scraping: link" to the console but not a "Requesting link: link" to the console.

Which means the thread is stalling when actually setting up the webdriver. As far as I've ever seen this is thread safe and I've tried using lock.aquire and giving it a random .exe out of a batch of 20 with the same results.

Sometimes the threads will work perfectly then out of nowhere one stops without being able to make the request.

Update:

Sometimes when I close the console it tells me there was a socket.error. You can see the start of the try in that snippet there I have this at the end:

except:
                trry +=1
                e = sys.exc_info()[0]
                print "Problem scraping link: ", e

But it'll happily sit there for hours saying nothing until I physically close the console. Then it pops up with socket.error and the print "scraping: link" message for the thread which died.

Which actually suggests it's failing before even starting the while but that trry is set to 0 at the start of that thread and isn't referenced anywhere else. Plus there'd be no socket.error to be had if it didn't have a selenium webdriver so it must be blocking the earlier message as well.

Update #2:

It looks like it's happy to run for hours when running a single thread of the exact same code.

But a thread lock didn't make a difference.

Little stumped. Going to try a subprocess instead of a thread to see what that does.

Update #3:

Threading isn't stable long but subprocessing is. OK Python.

Upvotes: 2

Views: 2326

Answers (1)

Levi Noecker
Levi Noecker

Reputation: 3300

I've encountered this with both multithreading and multiprocessing, and when using Firefox, Chrome, or PhantomJS. For whatever reason, the call to instantiate the browser (e.q. driver = webdriver.Chrome()), never returns.

Most of my scripts are relatively short lived with few threads/processes, so the problem isn't often seen. I have a few scripts, however, that will run for several hours and create and destroy several hundred browser objects, and I'm guaranteed to experience the hang a few times a run.

My solution is to put the browser instantiation into its own function/method, and then decorate the function/method with one of the many timeout and retry decorators available from PyPI:

(this is untested)

from retrying import retry
from selenium import webdriver
from timeoutcontext import timeout, TimeoutException


def retry_if_timeoutexception(exception):
    return isinstance(exception, TimeoutException)


@retry(retry_on_exception=retry_if_timeoutexception, stop_max_attempt_number=3)
@timeout(30)  # Allow the function 30 seconds to create and return the object
def get_browser():
    return webdriver.Chrome()

https://pypi.python.org/pypi/retrying

https://pypi.python.org/pypi/timeoutcontext

Upvotes: 2

Related Questions