Reputation: 49
I am trying to access rows of the table from https://www.parliament.gov.za/hansard?sorts[date]=-1
. I ultimately want to download the PDFs contained in each row, but I am having a hard time accessing the rows of the table. When I inspect a table row element, I see that it is under the <tbody>
tag. However, I can't seem to access this data using BeautifulSoup. I have done a decent amount of web scraping, but this is the first time I've run into this issue. This is the code that I currently have:
import requests
from bs4
import BeautifulSoup as bs
url = 'https://www.parliament.gov.za/hansard?sorts[date]=-1'
request = requests.get(url)
soup = bs(request.text, 'html.parser')
table1 = soup.findAll('table')[0]
print(table1)
Output:
<table id="papers-table">
<thead>
<th>Name</th>
<th>House</th>
<th>Language</th>
<th>Date</th>
<th>Type</th>
<th data-dynatable-column="file_location" style="display:none">File Location</th>
</thead>
<tbody>
</tbody>
</table>
Clearly, there is nothing in the <tbody>
tag even though this is where I believe the row data should be. In general, whenever I try to find the tr tags, which is where Chrome says the row data is stored, I can't find any of the ones with the PDFs. I am fairly certain that the issue has something to do with the fact that the source code is missing this data as well, but I have no idea how to find it. Since it's on the website, I assume that there must be a way, right? Thanks!
Upvotes: 1
Views: 538
Reputation: 20128
The data is loaded dynamically, therefore requests
won't support it. However, the data is available via sending a GET
request to the websites API:
https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0
There's no need to use BeautifuSoup
, using just the requests
library is enough:
import requests
URL = "https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0"
BASE_URL = "https://www.parliament.gov.za/storage/app/media/Docs/"
response = requests.get(URL).json()
for data in response["records"]:
print(BASE_URL + data["file_location"])
Output:
https://www.parliament.gov.za/storage/app/media/Docs/hansard/3a888bc6-ffc7-46a1-9803-ffc148b07bfc.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/3eb3103c-2d3c-418f-bb24-494b17bdeb22.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/bf0afdf8-352c-4dde-a380-11ce0a038dad.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/285e1633-aaeb-4a0d-bd54-98a4d5ec5127.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/966926ce-4cfe-4f68-b4a1-f99a09433137.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/d4bdb2c2-e8c8-461f-bc0b-9ffff3403be3.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/daecc145-bb44-47f1-a3b2-9400437f71d8.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/4f204d7e-0a25-4b64-b5a7-46c8730abe91.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/f2863e16-b448-46e3-939d-e14859984513.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/cd30e289-2ff2-47f5-b2a7-77e496e52f3a.pdf
Upvotes: 2
Reputation: 81
you're trying to scrape dynamic content whereas you're only loading the html for a static page with requests. Look for ways to get the dynamically generated html code or use selenium.
Upvotes: 0