Reputation: 441
In my django app I use selenium for crawling and parsing some html page. I tried to introduce the multiprocess to improve performance. This is my code:
import os
from selenium import webdriver
from multiprocessing import Pool
os.environ["DISPLAY"]=":56017"
def render_js(url):
driver = webdriver.Firefox()
driver.set_page_load_timeout(300)
driver.get(url)
text = driver.page_source
driver.quit()
return text
def parsing(url):
text = render_js(url)
... parsing the text ....
... write in db....
url_list = ['www.google.com','www.python.com','www.microsoft.com']
pool = Pool(processes=2)
pool.map_async(parsing, url_list)
pool.close()
pool.join()
I have this error when two processes work together simultaneously and use selenium: the first process starts firefox with 'www.google.it' and it returns the correct text, the second with url 'www.python.com' returns the text of www.google.it and not of www.python.com. Can you tell me where I'm wrong?
Upvotes: 2
Views: 4992
Reputation: 71
from selenium import webdriver
from multiprocessing import Pool
def parsing(url):
driver = webdriver.Chrome()
driver.set_page_load_timeout(300)
driver.get(url)
text = driver.page_source
driver.close()
return text
url_list = ['http://www.google.com', 'http://www.python.com']
pool = Pool(processes=4)
ret = pool.map(parsing, url_list)
for text in ret:
print text[:30]
I tried running your code and Selenium complained about bad urls. Adding http://
to it made it work.
Upvotes: 3