chris
chris

Reputation: 2020

Using one PhantomJS browser for each worker rather than creating new instance for each URL in queue slows performance dramatically

I'm using selenium's PhantomJS browser to scrape a website. I'm also using threading via queues with workers, because I have many many queries to run. Previously I would create a new instance of the web browser to visit every URL, like the following:

def worker():
    while True:   
        driver = webdriver.PhantomJS()
        driver.set_page_load_timeout(10)
        driver.set_window_size(1400,1000)
        params = q.get()
        params = params + (driver,)
        print params
        crawl(*params)
        driver.quit()
        q.task_done()

I saw online that people suggested opening the browser was costly, and so I should just open a browser for each worker and use it every time. I tried this, but it actually decreased the speed dramatically, and when I checked my computer in the morning, it was using almost the entire capacity of my ram and I had to restart the computer to stop the program from running. Here's my code - if any of you know how I can use one browser per worker whilst increasing speed, please let me know! thanks!

def worker():
    while True:
        try:
            driver
        except NameError:
            driver = PhantomJS()
            driver.set_page_load_timeout(10)
            driver.set_window_size(1400,1000)
        params = q.get()
        params = params + (driver,)
        print params
        crawl(*params)
        q.task_done()

q = Queue()

for i in range(10):
    t = Thread(target=worker)
    t.daemon = True
    t.start()

# all dates from init_date to 12/31/2015
date_period = 10
#init_date = "01/01/90"
init_date = "01/01/01"
for i in range(1826/date_period): 
    start_date, end_date = date_inc(init_date, i*date_period), 
    params = (start_date, end_date, 0)
    q.put(params)

q.join()  

Upvotes: 0

Views: 81

Answers (1)

Bert Jan Schrijver
Bert Jan Schrijver

Reputation: 1531

It's true that starting up a browser takes some time. It's also true that keeping a browser running for a while will increase memory usage for most browsers.

I had a similar problem once running Internet Explorer and making lots of screenshots for reporting of automated tests. I ended up restarting the browser after every so much tests.

That might help too in your case: use one (or a few) browsers and restart the browser(s) every n seconds or after every n commands.

Upvotes: 1

Related Questions