Reputation: 43
I am trying to scrape information from the "Addresses" tab on the webpage: https://training.gov.au/Organisation/Details/90003 using Python. However, I'm encountering an issue where, even after targeting the correct css selector, or tag, the code only returns null values. Strangely, it works correctly when I target the "Summary" tab. It seems that the website only returns data for the "Summary" tab. I have zero experienced in coding, so I'm unsure if there are specific considerations I need to keep in mind.
I am attempting to scrape data from the "Addresses" tab on this webpage: https://training.gov.au/Organisation/Details/90003.
I have inspected the webpage and identified the relevant css selector, or tags to target for scraping.
I am using Python for web scraping and have tried libraries like Beautiful Soup and Requests.
My code works as expected when I scrape data from the "Summary" tab, but it returns null values when I try to scrape from the "Addresses" tab.
I suspect that there might be some specific JavaScript or dynamic content loading that prevents data from being retrieved from the "Addresses" tab.
I would appreciate any guidance on how to access and scrape data from the "Addresses" tab successfully.
Code Sample:
Here's the version of the code I'm currently using to scrape data from the "Addresses" tab:
import requests
from bs4 import BeautifulSoup
# URL of the webpage
url = 'https://training.gov.au/Organisation/Details/90003'
# Send an HTTP GET request to fetch the webpage
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Use the CSS selector to target the element with id "rtoDetails-4"
target_element = soup.select_one('#rtoDetails-1') # works for rtoDetails-1 but not other selector
# Check if the element was found
if target_element:
# Extract and print the text content of the element
print(target_element.text.strip())
else:
print("Target element not found.")
else:
print("Failed to retrieve the webpage. Status code:", response.status_code)
Expected Output:
I expect the rtoDetails-4 variable to contain the information from the "Addresses" tab, but it currently returns null.
Additional Information:
Any insights or recommendations on how to handle dynamic content or JavaScript-based loading on webpages would be greatly appreciated. If there are specific steps I need to follow or if I'm missing something crucial, please provide detailed guidance as I'm relatively new to coding. Thank you in advance for your assistance!
Upvotes: 2
Views: 48
Reputation: 195553
The addresses you see on the page is loaded from external URL. You can use this example how to download the right HTML:
import requests
from bs4 import BeautifulSoup
link = "https://training.gov.au/Organisation/Details/90003"
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")
link = soup.select_one('[href*="AjaxDetailsLoadAddresses"]')["href"]
link = "https://training.gov.au" + link
soup = BeautifulSoup(requests.get(link).content, "html.parser")
print(soup.get_text(strip=True, separator=" "))
Prints:
...
Job title: Chief Executive Officer Organisation name: Technical and Further Education Commission Phone: (02) 7920
...
Upvotes: 1