Reputation: 53
I'm trying to scrape some data from here: https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly.
I'd like to get the dates in the first row (ie. 31-Mar-21 31-Dec-20 30-Sep-20 30-Jun-20 31-Mar-20).
The problem comes when I try to get the date, with bs4 it outputs nothing. I wrote this code:
url = "https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly"
html_content = requests.get(url).text
soup = BeautifulSoup (html_content, "lxml")
a = soup.find('div', attrs = {"class": "tables-container"})
date = a.find("time").text;
When I execute it, it gives me nothing. Printing a, it can be seen that the find () doesn't get the date ... `
<th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>
Thanks.
Upvotes: 1
Views: 141
Reputation: 195418
The data is embedded within the page in JSON form. You can use this example how to parse it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").contents[0])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
x = data["props"]["initialState"]["markets"]["financials"]["financial_tables"]
headers = x["income_interim_tables"][0]["headers"]
print(*headers, sep="\n")
Prints:
2021-03-31
2020-12-31
2020-09-30
2020-06-30
2020-03-31
Upvotes: 3
Reputation: 553
As I do not have enough reputation to comment:
The problem is that the scraped HTML does not contain the dates. The time
tags are empty.
You need a way to scrape while pre-rendering the JavaScript which fills in the dates. This is a different topic which requires some headless browser or other approaches, e.g. https://www.scrapingbee.com/blog/scrapy-javascript/
Upvotes: 0