Reputation: 759
Hi I am trying to scrape this website I originally was using Bs4 and that was fine to get certain elements. Sector, name etc. But I am not able to use it to get the financial data. Below I have copied some of the page_source the "-" should be in this case 0.0663. I believe I am trying to scrape javascript and I have looked around and none of the solutions I have seen have worked for me. I was wondering if someone could help me crack this.
Although I will be grateful if someone can post some working code I would also really appreciate if you can point me in the right direction as well to understand what to look for in the html which shows me what I need to do and how to get it kinda thing.
URL: https://www.tradingview.com/symbols/LSE-TSCO/
HTML:
<span class="tv-widget-fundamentals__label apply-overflow-tooltip">
Return on Equity (TTM)
</span>
<span class="tv-widget-fundamentals__value apply-overflow-tooltip">
—
</span>
Python Code:
url = "https://www.tradingview.com/symbols/LSE-TSCO/"
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
html = driver.page_source
Upvotes: 0
Views: 586
Reputation: 1876
The issue here is not with the element being present or not, but the time the page takes to load. The page looks very heavy with all those dynamic graphs..Even before the page is fully loaded in, the DOM start to get created and default values are taking place.
WebDriverWait
with find_element_*
works when the element is currently not present but will take a certain time to appear. In your context, it is present from the start and adding it won't do much. This is also why you get '-', as the element is present with its default value.
To fix this or reduce the issue, you can add code to wait until the document readyState is completed
Something like this can be used:
def wait_for_page_ready_state(driver):
wait = WebDriverWait(driver, 20)
def _ready_state_script(driver):
return driver.execute_async_script(
"""
var callback = arguments[arguments.length - 1];
callback(document.readyState);
""") == 'complete'
wait.until(_ready_state_script)
wait_for_page_ready_state(driver)
Then since you brought bs4 in play, this is where I would use it:
financials = {}
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
try:
key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
"apply-overflow-tooltip"}).text.strip())
value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())
financials[key] = value
except AttributeError:
pass
This will give you every value you need from the financial card.
You can now print the value you need:
print(financials['Return on Equity (TTM)'])
Output:
'0.0663'
Of course you can do the above with selenium as well, but wanted to provide with what you started to work with.
To be noted that this does not guaranty to always return the proper value. It might and did in my case, but since you know the default value you could add a while loop until the default change.
[EDIT] After running my code in a loop, I was hitting the default value 1/5 times. One way to work around it would be to create a method and loop until a threshold is reached. In my finding, you will always have ~90% of the value updated with digit. When it fails with the default value, all other values were also at '-'. One way will be to use a threshold (i.e 50% and only return the values once it is reached).
def get_financial_card_values(default_value='—', threshold=.5):
financials = {}
while True:
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
try:
key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
"apply-overflow-tooltip"}).text.strip())
value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())
financials[key] = value
except AttributeError:
pass
number_of_updated_values = [value for value in financials.values() if value != default_value]
if len(number_of_updated_values) / len(financials) > threshold:
return financials
With this method, I was able to always retrieve the value you are expecting. Note that if all values won't change (site issue) you will be in a loop for ever, you might want to use a timer instead of while True
. Just want to point this out, but I don't think it will happen.
Upvotes: 0
Reputation: 33384
To get the equity value.Induce WebDriverWait
() and wait for visibility_of_element_located
() and below xpath.
driver.get(url)
print(WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,"//span[contains(.,'Return on Equity (TTM)')]/following-sibling::span[1]"))).text)
You need to import below libraries.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Upvotes: 2