z star
z star

Reputation: 712

Scraping table of data from webpage with inconsistently nested html tags

I am trying to scrape some data off of the tables in https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/ Specifically, I want to scrape the 'Metropolitan tram' table. However, the html elements aren't structured well and I am unsure how to identify the table by name and scrape the content.

This is what I have tried:

import requests
from bs4 import BeautifulSoup

URL = "https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")


tables = soup.find_all("div", class_="mceTmpl table__wrapper")
for table in tables:
    print("NEXT-------------------------------------------")
    print(table, end="\n"*2)

Upvotes: 0

Views: 54

Answers (1)

HedgeHog
HedgeHog

Reputation: 25048

May use pandas.read_html() in case of scraping tables, what is best practice and uses BeautifulSoup under the hood and select your table from list by index.

Alternative use css selectors :

soup.select('h3:has(a[name="metrotram"]) + div > div:first-of-type tr')

Example

import pandas as pd
import requests
from bs4 import BeautifulSoup
pd.read_html(
    requests.get(
        'https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/', 
        headers={'user-agent':'some agent'}
    ).text,
    header=0
)[1]

Output

Unnamed: 0 % timetable delivered % services on-time at timing points
0 Sunday, 5 February 2023 99.4% 83.3%
1 Saturday, 4 February 2023 99.4% 81.8%
2 Friday, 3 February 2023 98.4% 79.7%
3 Thursday, 2 February 2023 97.9% 72.8%
4 Wednesday, 1 February 2023 98.9% 79.1%
5 Tuesday, 31 January 2023 99.0% 81.4%
6 Monday, 30 January 2023 99.3% 90.2%

Upvotes: 2

Related Questions