Seamus Lam
Seamus Lam

Reputation: 155

How to speed up python selenium find_elements?

I am trying to scrape company info from kompass.com

However, as each company profile provide different amount of details, certain pages may have missing elements. For example, not all companies have info on 'Associations'. In such cases, my script takes extremely long searching for these missing elements. Is there anyway I can speed up the search process?

Here's the excerpt of my script:

import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import ElementNotVisibleException
from lxml import html

def init_driver():
    driver = webdriver.Firefox()
    driver.wait = WebDriverWait(driver, 5)
    return driver

def convert2text(webElement):
    if webElement != []:
        webElement = webElement[0].text.encode('utf8')
    else:
        webElement = ['NA']
    return webElement

link='http://sg.kompass.com/c/mizkan-asia-pacific-pte-ltd/sg050477/'
driver = init_driver()
driver.get(link)
driver.implicitly_wait(10)

name = driver.find_elements_by_xpath("//*[@id='productDetailUpdateable']/div[1]/div[2]/div/h1")
name = convert2text(name)

## Problem:
associations = driver.find_elements_by_xpath("//body//div[@class='item minHeight']/div[@id='associations']/div/ul/li/strong")
associations = convert2text(associations)

It takes more than a minute to scrape each page and I have more than 26,000 pages to scrape.

Upvotes: 3

Views: 8226

Answers (2)

Guy
Guy

Reputation: 50939

driver.implicitly_wait(10) tell the driver to wait up to 10 seconds for the element to exist in the DOM. That means that each time you are looking for non-existing element it waits for 10 seconds. Reducing the time to 2-3 seconds will improve the run time.

In addition, xpath is the slowest selector, and you are making it worth by giving absolute path. Use find_elements_by_id and find_elements_by_class_name where you can. For example, you can improve

driver.find_elements_by_xpath("//body//div[@class='item minHeight']/div[@id='associations']/div/ul/li/strong")

Simply by stating with the associations id

driver.find_elements_by_xpath("//*div[@id='associations']/div/ul/li/strong")

Or changing it to css_selector

driver.find_elements_by_css_selector("#associations > div > ul > li > strong")

Upvotes: 6

Ben C
Ben C

Reputation: 486

Since your XPaths are not using any attributes apart from class and id to find elements, you could migrate your searches to CSS Selectors. These may be faster on browsers like IE where native XPath searching is not supported.

For example:

//body//div[@class='item minHeight']/div[@id='associations']/div/ul/li/strong

Can become:

body .item .minHeight > #associations > div > ul > li > strong

Upvotes: 1

Related Questions