Rapid1898
Rapid1898

Reputation: 1220

Scraping (BeautifulSoap, Selenium) not possible for all DIVs?

i try to scrape some information from a website - for most of the div-informations this works fine - but i have problem reading some specific DIVs. At first i only tried it with "normal" bs4-request - but then also with selenium - but i still get no data back...

Below you can find my full code. It works fine with a response with this search:

tmpDiv = soup.find ("div", {"id": "financial-strength"})

But it is not working with this div:

tmpDiv = soup.find ("div", {"id": "analyst-estimate"})

It outputs only

<div class="children" data-v-39722e0c="" id="analyst-estimate" style="min- 
 height:200px;display:block;">
</div>

Below you can find the full (not working) code

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import sys, os
from selenium.webdriver.chrome.options import Options

link = "https://www.gurufocus.com/stock/AAPL/summary"
path = os.path.abspath (os.path.dirname (sys.argv[0]))
options = Options ()
options.add_argument ('--headless')
options.add_experimental_option ('excludeSwitches', ['enable-logging'])
cd = '/chromedriver.exe'
driver = webdriver.Chrome (path + cd, options=options)
driver.get (link)
soup = BeautifulSoup (driver.page_source, 'html.parser')
time.sleep (2)

page = requests.get (link)
soup = BeautifulSoup (page.content, 'html.parser')
# tmpDiv = soup.find ("div", {"id": "financial-strength"})
tmpDiv = soup.find ("div", {"id": "analyst-estimate"})
print(tmpDiv.prettify())

I heard this is probably a "lazy loading website" - but shouldn´t the selenium-access wait till the full site is loaded with all the content?

Upvotes: 0

Views: 82

Answers (1)

HedgeHog
HedgeHog

Reputation: 25196

What happens?

There are two major things, why you wont get the result:

  1. After requesting website with selenium you also requesting it with requests and assign the response to soup.

  2. Data wont be loaded, if not needed, that is what you already figured out --> "lazy loading website"

How to fix that?

  1. Remove all requests specific lines

  2. Scroll the element you need into view, so that data is loading:

    element = driver.find_element_by_id("analyst-estimate")
    driver.execute_script("arguments[0].scrollIntoView();", element)
    

Example

Be aware, I added my webdriver path, so you have to edit it.

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

link = "https://www.gurufocus.com/stock/AAPL/summary"
driver = webdriver.Chrome ('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get (link)

time.sleep(2.5)

element = driver.find_element_by_id("analyst-estimate")
driver.execute_script("arguments[0].scrollIntoView();", element)

time.sleep(1)

soup = BeautifulSoup (driver.page_source, 'html.parser')

# tmpDiv = soup.find ("div", {"id": "financial-strength"})
tmpDiv = soup.find ("div", {"id": "analyst-estimate"})
print(tmpDiv.prettify())

Output

<div class="children" data-v-39722e0c="" id="analyst-estimate" style="">
 <div class="capture-area">
  <h2 class="fs-large fc-primary fw-bolder">
   Analyst Estimate
  </h2>
  <table class="normal-table-mobile financial-strength-table">
   <tbody>
    <tr>
     <td>
     </td>
     <td>
      Sep 2021
     </td>
     <td>
      Sep 2022
     </td>
     <td>
      Sep 2023
     </td>
    </tr>
    <tr>
     <td>
      Revenue (Mil $)
     </td>
     <td>
      <span>
       313003.40
      </span>
     </td>
     <td>
      <span>
       328872.10
      </span>
     </td>
     <td>
      <span>
       341577.60
      </span>
     </td>
    </tr>
    <tr>
     <td>
      EBIT (Mil $)
     </td>
     <td>
      <span>
       76803.87
      </span>
     </td>
     <td>
      <span>
       81038.89
      </span>
     </td>
     <td>
      <span>
       84830.53
      </span>
     </td>
    </tr>
    <tr>
     <td>
      EBITDA (Mil $)
     </td>
     <td>
      <span>
       88706.60
      </span>
     </td>
     <td>
      <span>
       92604.88
      </span>
     </td>
     <td>
      <span>
       94034.53
      </span>
     </td>
    </tr>
    <tr>
     <td>
      EPS ($)
     </td>
     <td>
      <span>
       3.94
      </span>
     </td>
     <td>
      <span>
       4.28
      </span>
     </td>
     <td>
      <span>
       4.55
      </span>
     </td>
    </tr>
    <tr>
     <td>
      EPS without NRI ($)
     </td>
     <td>
      <span>
       3.97
      </span>
     </td>
     <td>
      <span>
       4.27
      </span>
     </td>
     <td>
      <span>
       4.55
      </span>
     </td>
    </tr>
    <tr>
     <td>
      EPS Growth Rate (%)
     </td>
     <td>
      <span>
       10.04
      </span>
     </td>
     <td>
      <!-- -->
     </td>
     <td>
      <!-- -->
     </td>
    </tr>
    <tr>
     <td>
      Dividends per Share ($)
     </td>
     <td>
      <span>
       0.74
      </span>
     </td>
     <td>
      <span>
       0.82
      </span>
     </td>
     <td>
      <span>
       1.15
      </span>
     </td>
    </tr>
   </tbody>
  </table>
 </div>
</div>

Upvotes: 1

Related Questions