Reputation: 155
I am trying to scrape company info from kompass.com
However, as each company profile provide different amount of details, certain pages may have missing elements. For example, not all companies have info on 'Associations'. In such cases, my script takes extremely long searching for these missing elements. Is there anyway I can speed up the search process?
Here's the excerpt of my script:
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import ElementNotVisibleException
from lxml import html
def init_driver():
driver = webdriver.Firefox()
driver.wait = WebDriverWait(driver, 5)
return driver
def convert2text(webElement):
if webElement != []:
webElement = webElement[0].text.encode('utf8')
else:
webElement = ['NA']
return webElement
link='http://sg.kompass.com/c/mizkan-asia-pacific-pte-ltd/sg050477/'
driver = init_driver()
driver.get(link)
driver.implicitly_wait(10)
name = driver.find_elements_by_xpath("//*[@id='productDetailUpdateable']/div[1]/div[2]/div/h1")
name = convert2text(name)
## Problem:
associations = driver.find_elements_by_xpath("//body//div[@class='item minHeight']/div[@id='associations']/div/ul/li/strong")
associations = convert2text(associations)
It takes more than a minute to scrape each page and I have more than 26,000 pages to scrape.
Upvotes: 3
Views: 8226
Reputation: 50939
driver.implicitly_wait(10)
tell the driver to wait up to 10 seconds for the element to exist in the DOM. That means that each time you are looking for non-existing element it waits for 10 seconds. Reducing the time to 2-3 seconds will improve the run time.
In addition, xpath
is the slowest selector, and you are making it worth by giving absolute path. Use find_elements_by_id
and find_elements_by_class_name
where you can. For example, you can improve
driver.find_elements_by_xpath("//body//div[@class='item minHeight']/div[@id='associations']/div/ul/li/strong")
Simply by stating with the associations
id
driver.find_elements_by_xpath("//*div[@id='associations']/div/ul/li/strong")
Or changing it to css_selector
driver.find_elements_by_css_selector("#associations > div > ul > li > strong")
Upvotes: 6
Reputation: 486
Since your XPaths are not using any attributes apart from class and id to find elements, you could migrate your searches to CSS Selectors. These may be faster on browsers like IE where native XPath searching is not supported.
For example:
//body//div[@class='item minHeight']/div[@id='associations']/div/ul/li/strong
Can become:
body .item .minHeight > #associations > div > ul > li > strong
Upvotes: 1