Reputation: 103
I need to download all the files from this page :
that have "Auction of" on their titles. This is the source for one of the files for example:
<a href="/media/17527/pr090621b.pdf" aria-label="Auction of £2,500 million of 0 5/8% Treasury Gilt 2035, published 09 June 2021">Auction of £2,500 million of 0 5/8% Treasury Gilt 2035</a>
I am trying to adapt some code I found from another question, but the pages are coming back empty:
import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
def download_pgn(task):
session, url, destination_path = task
response = session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
game_url = host + soup.find("a", text="download").get("href")
filename = re.search(r"\w+\.pgn", game_url).group()
path = os.path.join(destination_path, filename)
response = session.get(game_url, stream=True)
response.raise_for_status()
with open(path, "wb") as f:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
if __name__ == "__main__":
destination_path = "pgns"
max_workers = 8
if not os.path.exists(destination_path):
os.makedirs(destination_path)
with requests.Session() as session:
response = session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
pages = soup.find_all("a", href=re.compile(r".*Auction of\?.*"))
tasks = [
(session, host + page.get("href"), destination_path)
for page in pages
]
with ThreadPoolExecutor(max_workers=max_workers) as pool:
pool.map(download_pgn, tasks)
Upvotes: 0
Views: 367
Reputation: 20018
The find_all()
method accepts a function. You can create a lambda
function to filter all for a
tags that contain "Auction of":
for tag in soup.find_all(lambda t: t.name == "a" and "Auction of" in t):
print(tag.text)
Or, you can use an [attribute*=value]
:
# Find all `aria-label` attributes under an `a` that contain `Auction of`
for tag in soup.select("a[aria-label*='Auction of']"):
print(tag.text)
Upvotes: 0
Reputation: 50190
Check your regular expression syntax. The regex r".*Auction of\?.*"
will only match titles with an actual of?
in the title.
But the href= parameter will search against the URL in the link, so that won't help you much either. This will find the links with the matching titles:
links = soup.find_all("a", string=re.compile(r"Auction of\b"))
And this will extract their URLs so you can retrieve them:
[ file["href"] for file in links ]
Upvotes: 1
Reputation: 103
This is what ended up working for me:
from bs4 import BeautifulSoup
import requests
import re
links = []
url = 'https://www.dmo.gov.uk/publications/?offset=0&itemsPerPage=1000000000&parentFilter=1433&childFilter=1433|1450&startMonth=1&startYear=2000&endMonth=6&endYear=2021'
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
for a in soup.find_all("a",{"aria-label":re.compile(r"^Auction of\b")}, href=True):
links.append(a['href'])
def download_file(url):
path = url.split('/')[-1].split('?')[0]
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r:
f.write(chunk)
host = 'https://www.dmo.gov.uk/'
for link in links:
url = host + link
download_file(url)
Upvotes: 0