Stackcans
Stackcans

Reputation: 351

Extract webpage elements up to a point

I want to extract elements from a webpage up to a point, and that is when it reaches this line on the webpage: <div class="clear"></div>. This appears twice on the webpage I am trying to extract, so I wanted to extract all the elements before the first one and then break.

For example:

hhref = ['https://www.ukfirestations.co.uk/stations/bedfordshire','https://www.ukfirestations.co.uk/stations/buckinghamshire']

dats = []
for i in range(0, 2, 1):
    r = requests.get(hhref[i], headers=headers)
    soup = BeautifulSoup(r.content)
    station = soup.find('div',{'id':'stations-grid'}).find_all('a')
    for j in station:
        dats.append(j['href'])

This extracts all the information, including those after <div class="clear"></div>. The webpage splits the stations I am after by 'current' and 'old'. I wanted to grab only those in the 'current' section, though I am unsure of how I can tell BeautifulSoup to extract elements up to a point.

Upvotes: 1

Views: 54

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195553

You can use CSS selector with :not to filter-out unwanted stations:

import requests
from bs4 import BeautifulSoup

hhref = [
    "https://www.ukfirestations.co.uk/stations/bedfordshire",
    "https://www.ukfirestations.co.uk/stations/buckinghamshire",
]

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}

for link in hhref:
    print(f"{link=}")
    print()
    r = requests.get(link, headers=headers)
    soup = BeautifulSoup(r.content, "html.parser")
    stations = soup.select(
        'h2:-soup-contains("Current Stations") ~ .stations-row .station:not(h2:-soup-contains("Old Stations") ~ .stations-row .station)'
    )

    for s in stations:
        print(
            "{:<30} {}".format(s.select_one(".station-name").text, s.a["href"])
        )

    print()

Prints:

link='https://www.ukfirestations.co.uk/stations/bedfordshire'

Ampthill                       https://www.ukfirestations.co.uk/station/ampthill
Bedford                        https://www.ukfirestations.co.uk/station/bedford
Biggleswade                    https://www.ukfirestations.co.uk/station/biggleswade
Dunstable                      https://www.ukfirestations.co.uk/station/dunstable
Harrold                        https://www.ukfirestations.co.uk/station/harrold
Headquarters                   https://www.ukfirestations.co.uk/station/headquarters-10
Kempston                       https://www.ukfirestations.co.uk/station/kempston
Leighton Buzzard               https://www.ukfirestations.co.uk/station/leighton-buzzard
Luton                          https://www.ukfirestations.co.uk/station/luton
Potton                         https://www.ukfirestations.co.uk/station/potton
Sandy                          https://www.ukfirestations.co.uk/station/sandy
Shefford                       https://www.ukfirestations.co.uk/station/shefford
Stopsley                       https://www.ukfirestations.co.uk/station/stopsley
Toddington                     https://www.ukfirestations.co.uk/station/toddington
Woburn                         https://www.ukfirestations.co.uk/station/woburn

link='https://www.ukfirestations.co.uk/stations/buckinghamshire'

Amersham                       https://www.ukfirestations.co.uk/station/amersham
Aylesbury & HQ                 https://www.ukfirestations.co.uk/station/aylesbury-hq
Beaconsfield                   https://www.ukfirestations.co.uk/station/beaconsfield
Brill                          https://www.ukfirestations.co.uk/station/brill
Broughton                      https://www.ukfirestations.co.uk/station/broughton
Buckingham                     https://www.ukfirestations.co.uk/station/buckingham
Chesham                        https://www.ukfirestations.co.uk/station/chesham
Gerrards Cross                 https://www.ukfirestations.co.uk/station/gerrards-cross
Great Missenden                https://www.ukfirestations.co.uk/station/great-missenden
Haddenham                      https://www.ukfirestations.co.uk/station/haddenham
High Wycombe                   https://www.ukfirestations.co.uk/station/high-wycombe
Marlow                         https://www.ukfirestations.co.uk/station/marlow
Newport Pagnell                https://www.ukfirestations.co.uk/station/newport-pagnell
Olney                          https://www.ukfirestations.co.uk/station/olney
Princes Risborough             https://www.ukfirestations.co.uk/station/princes-risborough
Stokenchurch                   https://www.ukfirestations.co.uk/station/stokenchurch
Waddesdon                      https://www.ukfirestations.co.uk/station/waddesdon
West Ashland                   https://www.ukfirestations.co.uk/station/milton-keynes
Winslow                        https://www.ukfirestations.co.uk/station/winslow

Or: use .find_previous() to check if you're in correct section:

for link in hhref:
    print(f"{link=}")
    print()
    r = requests.get(link, headers=headers)
    soup = BeautifulSoup(r.content, "html.parser")

    for s in soup.select(".station"):
        h2 = s.find_previous("h2")
        if h2.get_text(strip=True) != "Current Stations":
            continue

        print(
            "{:<30} {}".format(s.select_one(".station-name").text, s.a["href"])
        )

    print()

Upvotes: 1

Related Questions