Reputation: 351
I want to extract elements from a webpage up to a point, and that is when it reaches this line on the webpage: <div class="clear"></div>
. This appears twice on the webpage I am trying to extract, so I wanted to extract all the elements before the first one and then break.
For example:
hhref = ['https://www.ukfirestations.co.uk/stations/bedfordshire','https://www.ukfirestations.co.uk/stations/buckinghamshire']
dats = []
for i in range(0, 2, 1):
r = requests.get(hhref[i], headers=headers)
soup = BeautifulSoup(r.content)
station = soup.find('div',{'id':'stations-grid'}).find_all('a')
for j in station:
dats.append(j['href'])
This extracts all the information, including those after <div class="clear"></div>
. The webpage splits the stations I am after by 'current' and 'old'. I wanted to grab only those in the 'current' section, though I am unsure of how I can tell BeautifulSoup
to extract elements up to a point.
Upvotes: 1
Views: 54
Reputation: 195553
You can use CSS selector with :not
to filter-out unwanted stations:
import requests
from bs4 import BeautifulSoup
hhref = [
"https://www.ukfirestations.co.uk/stations/bedfordshire",
"https://www.ukfirestations.co.uk/stations/buckinghamshire",
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
for link in hhref:
print(f"{link=}")
print()
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
stations = soup.select(
'h2:-soup-contains("Current Stations") ~ .stations-row .station:not(h2:-soup-contains("Old Stations") ~ .stations-row .station)'
)
for s in stations:
print(
"{:<30} {}".format(s.select_one(".station-name").text, s.a["href"])
)
print()
Prints:
link='https://www.ukfirestations.co.uk/stations/bedfordshire'
Ampthill https://www.ukfirestations.co.uk/station/ampthill
Bedford https://www.ukfirestations.co.uk/station/bedford
Biggleswade https://www.ukfirestations.co.uk/station/biggleswade
Dunstable https://www.ukfirestations.co.uk/station/dunstable
Harrold https://www.ukfirestations.co.uk/station/harrold
Headquarters https://www.ukfirestations.co.uk/station/headquarters-10
Kempston https://www.ukfirestations.co.uk/station/kempston
Leighton Buzzard https://www.ukfirestations.co.uk/station/leighton-buzzard
Luton https://www.ukfirestations.co.uk/station/luton
Potton https://www.ukfirestations.co.uk/station/potton
Sandy https://www.ukfirestations.co.uk/station/sandy
Shefford https://www.ukfirestations.co.uk/station/shefford
Stopsley https://www.ukfirestations.co.uk/station/stopsley
Toddington https://www.ukfirestations.co.uk/station/toddington
Woburn https://www.ukfirestations.co.uk/station/woburn
link='https://www.ukfirestations.co.uk/stations/buckinghamshire'
Amersham https://www.ukfirestations.co.uk/station/amersham
Aylesbury & HQ https://www.ukfirestations.co.uk/station/aylesbury-hq
Beaconsfield https://www.ukfirestations.co.uk/station/beaconsfield
Brill https://www.ukfirestations.co.uk/station/brill
Broughton https://www.ukfirestations.co.uk/station/broughton
Buckingham https://www.ukfirestations.co.uk/station/buckingham
Chesham https://www.ukfirestations.co.uk/station/chesham
Gerrards Cross https://www.ukfirestations.co.uk/station/gerrards-cross
Great Missenden https://www.ukfirestations.co.uk/station/great-missenden
Haddenham https://www.ukfirestations.co.uk/station/haddenham
High Wycombe https://www.ukfirestations.co.uk/station/high-wycombe
Marlow https://www.ukfirestations.co.uk/station/marlow
Newport Pagnell https://www.ukfirestations.co.uk/station/newport-pagnell
Olney https://www.ukfirestations.co.uk/station/olney
Princes Risborough https://www.ukfirestations.co.uk/station/princes-risborough
Stokenchurch https://www.ukfirestations.co.uk/station/stokenchurch
Waddesdon https://www.ukfirestations.co.uk/station/waddesdon
West Ashland https://www.ukfirestations.co.uk/station/milton-keynes
Winslow https://www.ukfirestations.co.uk/station/winslow
Or: use .find_previous()
to check if you're in correct section:
for link in hhref:
print(f"{link=}")
print()
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
for s in soup.select(".station"):
h2 = s.find_previous("h2")
if h2.get_text(strip=True) != "Current Stations":
continue
print(
"{:<30} {}".format(s.select_one(".station-name").text, s.a["href"])
)
print()
Upvotes: 1