Chuixi
Chuixi

Reputation: 43

Scraping Addresses Tab Data from https://training.gov.au/Organisation/Details/90003 Using Python

I am trying to scrape information from the "Addresses" tab on the webpage: https://training.gov.au/Organisation/Details/90003 using Python. However, I'm encountering an issue where, even after targeting the correct css selector, or tag, the code only returns null values. Strangely, it works correctly when I target the "Summary" tab. It seems that the website only returns data for the "Summary" tab. I have zero experienced in coding, so I'm unsure if there are specific considerations I need to keep in mind.

  1. I am attempting to scrape data from the "Addresses" tab on this webpage: https://training.gov.au/Organisation/Details/90003.

  2. I have inspected the webpage and identified the relevant css selector, or tags to target for scraping.

  3. I am using Python for web scraping and have tried libraries like Beautiful Soup and Requests.

  4. My code works as expected when I scrape data from the "Summary" tab, but it returns null values when I try to scrape from the "Addresses" tab.

  5. I suspect that there might be some specific JavaScript or dynamic content loading that prevents data from being retrieved from the "Addresses" tab.

  6. I would appreciate any guidance on how to access and scrape data from the "Addresses" tab successfully.

Code Sample:

Here's the version of the code I'm currently using to scrape data from the "Addresses" tab:

import requests
from bs4 import BeautifulSoup

# URL of the webpage
url = 'https://training.gov.au/Organisation/Details/90003'

# Send an HTTP GET request to fetch the webpage
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Use the CSS selector to target the element with id "rtoDetails-4"
    target_element = soup.select_one('#rtoDetails-1') # works for rtoDetails-1 but not other selector

    # Check if the element was found
    if target_element:
        # Extract and print the text content of the element
        print(target_element.text.strip())
    else:
        print("Target element not found.")
else:
    print("Failed to retrieve the webpage. Status code:", response.status_code)

Expected Output:

I expect the rtoDetails-4 variable to contain the information from the "Addresses" tab, but it currently returns null.

Additional Information:

Any insights or recommendations on how to handle dynamic content or JavaScript-based loading on webpages would be greatly appreciated. If there are specific steps I need to follow or if I'm missing something crucial, please provide detailed guidance as I'm relatively new to coding. Thank you in advance for your assistance!

Upvotes: 2

Views: 48

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195553

The addresses you see on the page is loaded from external URL. You can use this example how to download the right HTML:

import requests
from bs4 import BeautifulSoup

link = "https://training.gov.au/Organisation/Details/90003"
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")

link = soup.select_one('[href*="AjaxDetailsLoadAddresses"]')["href"]
link = "https://training.gov.au" + link

soup = BeautifulSoup(requests.get(link).content, "html.parser")

print(soup.get_text(strip=True, separator=" "))

Prints:

...

Job title: Chief Executive Officer Organisation name: Technical and Further Education Commission Phone: (02) 7920 
...

Upvotes: 1

Related Questions