JPWilson
JPWilson

Reputation: 759

Scraping using selenium

Hi I am trying to scrape this website I originally was using Bs4 and that was fine to get certain elements. Sector, name etc. But I am not able to use it to get the financial data. Below I have copied some of the page_source the "-" should be in this case 0.0663. I believe I am trying to scrape javascript and I have looked around and none of the solutions I have seen have worked for me. I was wondering if someone could help me crack this.

Although I will be grateful if someone can post some working code I would also really appreciate if you can point me in the right direction as well to understand what to look for in the html which shows me what I need to do and how to get it kinda thing.

URL: https://www.tradingview.com/symbols/LSE-TSCO/

HTML:

<span class="tv-widget-fundamentals__label apply-overflow-tooltip">
    Return on Equity (TTM)
</span>
<span class="tv-widget-fundamentals__value apply-overflow-tooltip">
    —
</span>

Python Code:

url = "https://www.tradingview.com/symbols/LSE-TSCO/"
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
html = driver.page_source

Upvotes: 0

Views: 586

Answers (3)

Nic Laforge
Nic Laforge

Reputation: 1876

The issue here is not with the element being present or not, but the time the page takes to load. The page looks very heavy with all those dynamic graphs..Even before the page is fully loaded in, the DOM start to get created and default values are taking place.

WebDriverWait with find_element_* works when the element is currently not present but will take a certain time to appear. In your context, it is present from the start and adding it won't do much. This is also why you get '-', as the element is present with its default value.

To fix this or reduce the issue, you can add code to wait until the document readyState is completed

Something like this can be used:

def wait_for_page_ready_state(driver):
    wait = WebDriverWait(driver, 20)

    def _ready_state_script(driver):
        return driver.execute_async_script(
                """
                var callback = arguments[arguments.length - 1]; 
                callback(document.readyState);
                """) == 'complete'
    wait.until(_ready_state_script)

wait_for_page_ready_state(driver)

Then since you brought bs4 in play, this is where I would use it:

financials = {}
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
    try:
        key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
                                                       "apply-overflow-tooltip"}).text.strip())
        value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())


        financials[key] = value
    except AttributeError:
        pass

This will give you every value you need from the financial card.

You can now print the value you need:

print(financials['Return on Equity (TTM)'])

Output:

'0.0663'

Of course you can do the above with selenium as well, but wanted to provide with what you started to work with.

To be noted that this does not guaranty to always return the proper value. It might and did in my case, but since you know the default value you could add a while loop until the default change.

[EDIT] After running my code in a loop, I was hitting the default value 1/5 times. One way to work around it would be to create a method and loop until a threshold is reached. In my finding, you will always have ~90% of the value updated with digit. When it fails with the default value, all other values were also at '-'. One way will be to use a threshold (i.e 50% and only return the values once it is reached).

    def get_financial_card_values(default_value='—', threshold=.5):
        financials = {}
        while True:
            for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
                try:
                    key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
                                                                       "apply-overflow-tooltip"}).text.strip())
                    value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())

                    financials[key] = value
                except AttributeError:
                    pass
            number_of_updated_values = [value for value in financials.values() if value != default_value]
            if len(number_of_updated_values) / len(financials) > threshold:
                return financials

With this method, I was able to always retrieve the value you are expecting. Note that if all values won't change (site issue) you will be in a loop for ever, you might want to use a timer instead of while True. Just want to point this out, but I don't think it will happen.

Upvotes: 0

KunduK
KunduK

Reputation: 33384

To get the equity value.Induce WebDriverWait() and wait for visibility_of_element_located() and below xpath.

driver.get(url)
print(WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,"//span[contains(.,'Return on Equity (TTM)')]/following-sibling::span[1]"))).text)

You need to import below libraries.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Upvotes: 2

sleep
sleep

Reputation: 104

You can get the return on equity using xpath

equity = driver.find_element_by_xpath('/html/body/div[2]/div[4]/div/div/div/div/div/div[2]/div[2]/div[2]/div/div[2]/div[1]/div/div/div[1]/div[3]/div[3]/span[2]').text
print(equity)

Upvotes: 1

Related Questions