Reputation: 13
I'm trying to get the table at this URL: https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2 . I tried reading it qith requests and BeautifulSoup:
from bs4 import BeautifulSoup as bs
import requests
s = requests.session()
req = s.get('https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2', headers={
"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/51.0.2704.103 Safari/537.36"})
soup = bs(req.content)
table = soup.find('table')
However, I only get the headers of the table.
<table class="table">
<caption class="pl8">Ricoverati e posti letto in area non critica e terapia intensiva.</caption>
<thead>
<tr>
<th class="cella-tabella-sm align-middle text-center" scope="col">Regioni</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Area Non Critica</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Area Non Critica</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Terapia intensiva</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Terapia Intensiva</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL Terapia Intensiva attivabili</th>
</tr>
</thead>
<tbody id="tab2_body">
</tbody>
</table>
So I tried with the URL i think the table is located: https://Agenas:[email protected]/covid19/web/index.php?r=json%2Ftab2 . But in this case I always get 401 status code, even adding in headers username and password as shown in previous request. For example:
requests.get('https://Agenas:[email protected]/covid19/web/index.php?r=json%2Ftab2', headers={'username':'Agenas', 'password':'tab2-19'
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'})
Any idea on how to solve this? Thank you.
Upvotes: 1
Views: 119
Reputation: 20042
Those "secrets" needed for the headers
are actually embedded in a <script>
tag. So you can fish them out, parse'em to a JSON
and use in the request headers
.
Here's how:
import json
import re
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/89.0.4389.90 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
with requests.Session() as s:
end_point = "https://Agenas:[email protected]/covid19/web/index.php?r=json%2Ftab2"
regular_page = "https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2"
html = s.get(regular_page, headers=headers).text
soup = BeautifulSoup(html, "html.parser").find_all("script")[-1].string
hacked_payload = json.loads(
re.search(r"headers:\s({.*}),", soup, re.S).group(1).strip()
)
headers.update(hacked_payload)
print(json.dumps(s.get(end_point, headers=headers).json(), indent=2))
Output:
[
{
"regione": "Abruzzo",
"dato1": "667",
"dato2": "1495",
"dato3": "89",
"dato4": "215",
"dato5": "0"
},
{
"regione": "Basilicata",
"dato1": "164",
"dato2": "426",
"dato3": "12",
"dato4": "88",
"dato5": "13"
},
and so on ...
Upvotes: 2