Grant
Grant

Reputation: 49

Can't find <tr> data with BeautifulSoup

I am trying to access rows of the table from https://www.parliament.gov.za/hansard?sorts[date]=-1. I ultimately want to download the PDFs contained in each row, but I am having a hard time accessing the rows of the table. When I inspect a table row element, I see that it is under the <tbody> tag. However, I can't seem to access this data using BeautifulSoup. I have done a decent amount of web scraping, but this is the first time I've run into this issue. This is the code that I currently have:

import requests
from bs4
import BeautifulSoup as bs

url = 'https://www.parliament.gov.za/hansard?sorts[date]=-1'

request = requests.get(url)

soup = bs(request.text, 'html.parser')

table1 = soup.findAll('table')[0]

print(table1)

Output:

<table id="papers-table">
  <thead>
    <th>Name</th>
    <th>House</th>
    <th>Language</th>
    <th>Date</th>
    <th>Type</th>
    <th data-dynatable-column="file_location" style="display:none">File Location</th>
  </thead>
  <tbody>
  </tbody>
</table>

Clearly, there is nothing in the <tbody> tag even though this is where I believe the row data should be. In general, whenever I try to find the tr tags, which is where Chrome says the row data is stored, I can't find any of the ones with the PDFs. I am fairly certain that the issue has something to do with the fact that the source code is missing this data as well, but I have no idea how to find it. Since it's on the website, I assume that there must be a way, right? Thanks!

Upvotes: 1

Views: 538

Answers (2)

MendelG
MendelG

Reputation: 20128

The data is loaded dynamically, therefore requests won't support it. However, the data is available via sending a GET request to the websites API:

https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0

There's no need to use BeautifuSoup, using just the requests library is enough:

import requests


URL = "https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0"
BASE_URL = "https://www.parliament.gov.za/storage/app/media/Docs/"
response = requests.get(URL).json()

for data in response["records"]:
    print(BASE_URL + data["file_location"])

Output:

https://www.parliament.gov.za/storage/app/media/Docs/hansard/3a888bc6-ffc7-46a1-9803-ffc148b07bfc.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/3eb3103c-2d3c-418f-bb24-494b17bdeb22.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/bf0afdf8-352c-4dde-a380-11ce0a038dad.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/285e1633-aaeb-4a0d-bd54-98a4d5ec5127.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/966926ce-4cfe-4f68-b4a1-f99a09433137.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/d4bdb2c2-e8c8-461f-bc0b-9ffff3403be3.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/daecc145-bb44-47f1-a3b2-9400437f71d8.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/4f204d7e-0a25-4b64-b5a7-46c8730abe91.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/f2863e16-b448-46e3-939d-e14859984513.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/cd30e289-2ff2-47f5-b2a7-77e496e52f3a.pdf

Upvotes: 2

Ashwin Sankar
Ashwin Sankar

Reputation: 81

you're trying to scrape dynamic content whereas you're only loading the html for a static page with requests. Look for ways to get the dynamically generated html code or use selenium.

Upvotes: 0

Related Questions