huzefausama
huzefausama

Reputation: 443

Scraping multiple webpages at once with Selenium

I am using selenium and Python to do a big project. I have to go through 320.000 webpages (320K) one by one and scrape details and then sleep for a second and move on.

Like bellow:

links = ["https://www.thissite.com/page=1","https://www.thissite.com/page=2", "https://www.thissite.com/page=3"]

for link in links:
    browser.get(link )
    scrapedinfo = browser.find_elements_by_xpath("*//div/productprice").text
    open("file.csv","a+").write(scrapedinfo)
    time.sleep(1)

The greatest problem : too slow!

With this script I will take days or maybe weeks.

I have spent hours finding answers on google and Stackoverflow and only found about multiprocessing.

But, I am unable to apply it in my script.

Upvotes: 3

Views: 4338

Answers (3)

imbr
imbr

Reputation: 7672

Threading approach

  • You should start with threading.Thread and it will give you a considerable performance boost (explained here). Also threads are lighter than processes. You can use a futures.ThreadPoolExecutor with each thread using its own webdriver. Consider also adding the headless option for your webdriver. Example bellow using a chrome-webdriver:
from concurrent import futures

def selenium_work(url):
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless") 
    driver = webdriver.Chrome(options=chromeOptions)  
    #<actual work that needs to be done be selenium>

# default number of threads is optimized for cpu cores 
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:     
    # store the url for each thread as a dict, so we can know which thread fails
    future_results = { url : executor.submit(selenium_work, links) for url in links }
    for url, future in future_results.items(): 
        try:        
           future.result() # can use `timeout` to wait max seconds for each thread  
      except Exception as exc: # can give a exception in some thread
           print('url {:0} generated an exception: {:1}'.format(url, exc))

  • Consider also storing the chrome-driver instance initialized on each thread using threading.local(). From here they reported a reasonable performance improvement.

  • Consider if using BeautifulSoup direct on the page from selenium can give some other speed-up. It's a very fast and stablished package. Example something like driver.get(url) ... soup = BeautifulSoup(driver.page_source,"lxml") ... result = soup.find('a')

Other approaches

  • Although I personally not saw much benefits on using concurrent.futures.ProcessPoolExecutor() you could experiment on that. In fact it was slower than threads on my experiments on Windows. Also on Windows you have many limitations for python Process.

  • Consider if your use case can be satisfied by using arsenic a asynchronous webdriver client built on asyncio. That really sound promissing, though having many limitations.

  • Consider if Requests-Html solves your problems with javascript load. Since it claims Full JavaScript support! In that case you could use it with BeautifulSoup on a standard data scraping methodology.

Upvotes: 5

Yousuf
Yousuf

Reputation: 13

If you are not scraping too security oriented website against bots, it is better to use Requests, it will reduce your time from days to couple of hours and implement multi-threading with multi-processing. Steps are too long to go over, here is just some idea:

def threader_run(data):
    futures = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        for i in data:
            futures.append(executor.submit(scrapper,i))
        for future in concurrent.futures.as_completed(futures):
            print(future.result())



data = {}
data['process1'] = []
data['process2'] = []
data['process3'] = []

if __name__ == "__main__":
    
    for x in data:
     
        jobs = []
        p = Process(target=threader_run,args(data[x],))
        jobs.append(p)
        p.start()
        print(f'Started - {x}')

Basically, what this is doing is first have all the links compiled then split them into 3 arrays for running 3 processes simultaneously (you could run more processes depending on your cpu cores and how data intensive these jobs are). After that split those arrays further could be more than 10 even 100 depending on your project size. This will run threadpool which have maximum 8 workers and then it will run your final function.

Here with 3 process and 8 workers you are looking at 24 times speed boost. however, Use of Requests library is necessary if you use selenium for this, normal Computers/laptops will freeze. Because this would mean 24 browsers running simultaneously.

Upvotes: -1

Gaj Julije
Gaj Julije

Reputation: 2183

You can use the paralel execution. Devide the list of sites for e.g in ten TC that are going to use same code, just method names will be different (method1, method2,method3,...). You will increse the speed. Number of the browsers depends on your hardver performances. See more on https://www.guru99.com/sessions-parallel-run-and-dependency-in-selenium.html

Main thing is to use Test NG and edit .xml file and set how many threads you want to use.Like this:

<suite name="TestSuite" thread-count="10" parallel="methods" >

Upvotes: 0

Related Questions