SidGabriel
SidGabriel

Reputation: 185

Run Selenium scrapers in parallel from script / Python

I just learned Selenium to scrape some data that couldn't be reached with Scrapy. I have made different scripts from different bots, and they run as expected individually.

The two browsers are opened one after another, while I want to have them running at the same time. Is there a simple way to do that? Thank you for your help!

EDIT :

I have been trying this with multiprocessing, as indicated. But unfortunately it runs only one script at the time, while two selenium windows are opened. One stay inactive. Here is my code if you want to have a look :

u = UBot()
google = GoogleBot()

list_test = [[u.main(), google.main()]]

processes = []
for test in list_test:
    p = multiprocessing.Process()
    processes.append(p)
    p.start()

EDIT 2:

I could solve it and posted code below !

Upvotes: 1

Views: 1747

Answers (2)

SidGabriel
SidGabriel

Reputation: 185

I could solve my problem based on your recommandations about Multiprocessing so thank you all :) I post my code in case other beginner would need something like this, although there is probably others (and better) way to do.

from google_bot import GoogleBot
from u_bot import UBot
from multiprocessing import Pool 

def google_process():

    google = GoogleBot()
    google.main()
    return

def u_process():

    u = UBot()
    u.main()
    return

def main():

    pool = Pool(processes=2)
    google = pool.apply_async(google_process)
    u = pool.apply_async(u_process)

    pool.close()
    pool.join()

Upvotes: 0

nicholishen
nicholishen

Reputation: 3012

I made a simple lib called selsunpool that wraps concurrent.futures you might want to try. It creates a local pool of selenium workers which stay alive and can be reused any number of times for concurrent jobs. It's not well documented at the moment, but it's simple to use. Here is an example.

Step 1: A function is made with the selenium job decorator. The decorator param defines the name of the kwarg that the webdriver is attached to (coming back from the pool executor).

from selsunpool import selenium_job, SeleniumPoolExecutor


@selenium_job(webdriver_param_name='mydriver')
def get_url(url, mydriver):
    mydriver.get(url)
    return mydriver.title

Step 2: Use the pool executor in the same way you'd use the ThreadpoolExecutor. Note: job results are retrieved via a property which is a generator which yields results as they are finished.

with SeleniumPoolExecutor(num_workers=2, close_on_exit=True) as pool:
    sites = ['https://google.com', 'https://msn.com']
    pool.map(get_url, sites)
    print(list(pool.job_results))

Upvotes: 3

Related Questions