Reputation: 2020
I'm using selenium's PhantomJS browser to scrape a website. I'm also using threading via queues with workers, because I have many many queries to run. Previously I would create a new instance of the web browser to visit every URL, like the following:
def worker():
while True:
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(10)
driver.set_window_size(1400,1000)
params = q.get()
params = params + (driver,)
print params
crawl(*params)
driver.quit()
q.task_done()
I saw online that people suggested opening the browser was costly, and so I should just open a browser for each worker and use it every time. I tried this, but it actually decreased the speed dramatically, and when I checked my computer in the morning, it was using almost the entire capacity of my ram and I had to restart the computer to stop the program from running. Here's my code - if any of you know how I can use one browser per worker whilst increasing speed, please let me know! thanks!
def worker():
while True:
try:
driver
except NameError:
driver = PhantomJS()
driver.set_page_load_timeout(10)
driver.set_window_size(1400,1000)
params = q.get()
params = params + (driver,)
print params
crawl(*params)
q.task_done()
q = Queue()
for i in range(10):
t = Thread(target=worker)
t.daemon = True
t.start()
# all dates from init_date to 12/31/2015
date_period = 10
#init_date = "01/01/90"
init_date = "01/01/01"
for i in range(1826/date_period):
start_date, end_date = date_inc(init_date, i*date_period),
params = (start_date, end_date, 0)
q.put(params)
q.join()
Upvotes: 0
Views: 81
Reputation: 1531
It's true that starting up a browser takes some time. It's also true that keeping a browser running for a while will increase memory usage for most browsers.
I had a similar problem once running Internet Explorer and making lots of screenshots for reporting of automated tests. I ended up restarting the browser after every so much tests.
That might help too in your case: use one (or a few) browsers and restart the browser(s) every n seconds or after every n commands.
Upvotes: 1