Web Scraping with Python Selenium performance

According to performance it is more than obvious that web scraping with BautifulSoup is much faster than using a webdriver with Selenium. However I don't know any other way to get content from a dynamic web page. I thought the difference comes from the time needed for the browser to load elements but it is definitely more than that. Once the browser loads the page(5 seconds) all I had to do is to extract some <tr> tags from a table. It took about 3-4 minutes to extract 1016 records which is extremely slow in my opinion. I came to a conclusion that webdriver methods for finding elements such as find_elements_by_name are slow. Is find_elements_by.. from webdriver much slower than the find method in BeautifulSoup? And would it be faster if I get the whole html from the webdriver browser and then parse it with lxml and use the BeautifulSoup?

Upvotes: 0

Answers (4)

undetected Selenium

Reputation: 193338

Web Scraping with Python using either with selenium or beautifulsoup should be a part of the testing strategy. Putting it straight if your intent is to scrape the static content BeautifulSoup is unmatched. But incase the website content is dynamically rendered Selenium is the way to go.

Having said that, BeautifulSoup won't wait for the dynamic content which isn't readily present in the DOM Tree once page loading completes. Where as using Selenium you have Implicit Wait and Explicit Wait at your disposal to locate the desired dynamic elements.

Finally, find_elements_by_name() may be delta expensive in terms of performance as Selenium translates it into it's equivalent find_element_by_css_selector(). You can find some more details in this discussion

Outro

Official locator strategies for the webdriver

Upvotes: 1

chitown88

Reputation: 28640

Look into 2 options:

1) sometimes these dynamic pages do actually have the data within <script> tags in a valid json format. You can use requests to get the html, beautifulsoup will get the <script> tag, then you can use json,loads() to parse.

2) go directly to the source. Look at the dev tools and search the XHR to see if you can go directly to the url/API and that generates the data and return the data that way (most likely again in json format). In my opinion, this is by far the better/faster option if available.

If you can provide the url, I can check to see if either of these options apply to your situation.

Upvotes: 1

pguardiario

Reputation: 55002

You could also try evaluating in javascript. For example this:

item = driver.execute_script("""return {
  div: document.querySelector('div').innerText,
  h2: document.querySelector('h2').innerText
}""")

will be at least 10x faster than this:

item = {
  "div": driver.find_element_by_css_selector('div').text,
  "h2": driver.find_element_by_css_selector('h2').text
}

I wouldn't be surprised if it was faster than BS a lot of the time too.

Upvotes: 0

ICTylor

Reputation: 470

Yes it would be much faster to use Selenium only to get the HTML after waiting for the page to be ready and then use BeautifulSoup or lxml to parse that HTML.

Another option could be to use Puppeteer either only to get the HTML or to get the info that you want directly. It should also be faster than Selenium. There are some unofficial python bindings for it: pyppeteer

Upvotes: 3

Web Scraping with Python Selenium performance

Answers (4)

Outro

Related Questions